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SUMMARY 



Air Force Hixman Resources Laboratory has recently launched an attack 
^ the problems associated with producing a meaningful criterion measure 
of job performance, ^-Changes In training tethndlogy are slo]Ni*y destroying 
technical training performance as the criterion which Hlstorlqally has^ . 
been used In the validation of selection and classification testa. ^^^^ 
situation, Qf course, Is decidedly Inconvenient, but one healthy affect 
* of It Is that we are being forced to take a' closer look at .the' poirtilblllty 
of 4?Y?_lppi?^_ a inpre directly relatdd to on-therjob performance, 

an effort which should copt^nue across thei j^ears Sxi any organization. with 
a practical Interest In predictor research,. * 

We h^ve high hopes, but few Illusions. We know that the criterion 
'^problem h^s been perhaps the most fntractable one in psychometrlcs sln^e 
Its Inception. But we know also that,* for some incomprehensible reason,' 
fewyConcerted and sustained efforts haVe been mounted on this most 
In^rtant research area. We do not expect to ''solve" t:ha criterion 
pwblem; but we hope we can make a few contributions, and we believe ^we . 
o«i at; least make some progress toward our modest goal — ;to develop, a 
sawLsfactory substitute for technical school grades to use( as a validation 
criterion for our predictor tests. V 

« ■ ' ■ 1 

This symposium was sponsored by AFOSR, 'with the Invaluable aSislstance 
of Captain Jack Thorpe. The purpose was to bring together several of the 
.. researchers who have been recently concerned wlt^ various aspects of * ; 
criterion research to exchange ideas over a 2-day period, and "to provide ^ 
discussion and critique of the directions our respective research efforts 
are .taking. More formal presentations of wo'rk and ideas' connected With 
ctlterion research by military sclent isjts comprised the central part pf 
t^e 2*day period. It was preceded by more informal tiiaterial In the way . 
of introductory remarks, and ic was followed by summary material provided^ 
\Cy a panel of five eminent researchers from the civilian community ^^i^ 
were invited to serve as expert consultants and to glve^^us; their views 
concerning our work. The informal- materials preceding and' following the ^ 
formal presentations were taken directly from tap^ recordings of. the 
proceedings, and, with minor editorial changes "^by ^the speakers (who were 
Invited to review their remarks prior to publication) appear Just as, they 
were spoken. / : 

^ We sincerely hope that the publication of these proceedings will be 
representative of the most advanced thinking currently available on^ / 
criterion research. We confidently believe that this publication oont 
thinking which will be helpful to anyone directly concerneA,-W±th4,tW 
challenging and fascinating area. : ' • v ' * ^ 
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* OPENING SrATEMENT 



Dr. Charles E. Hutchinson 
Air Force Office of Scientific Research 

I hav^ a memory for all of the wrong things. I can remember oije 
time spending 10" weeks in San Antonio, and the reason for being here 
was to deactivate the Air Force Personnel and Training Rev«:tch Center. 
Some 'of you niay have* memories that long. My irole was to cull through.^. 
the'^pi^Oductive effbrts of a lot of people t>oth In-house and by cohtrac- 
tuai support in th6 area of social psychology and social scienceaV 
which was supposedly rsy field, and recommend which should go to the 
archives, which rshbuld go/to the bum basket, and which to try to 
salvage^ * 

And r can bet you that this is^a much Happier time to be In San 
Antonio to ndt bury Caesar but -to praise him, and it's been one of the 
.delifthts (^jay short datreeV in'OSR-^I've only been there 8inc^'1956,. 
thjH saipe "year that I Reactivated AJl*fRC — and I got hooked by OSR and it 
becaaie an addiction. * .| 

" tut the^ ryison'^for OSR being, iitvolved is that OSR Is a research 
arm of the Air Foorce which reaches out to ^lie "resfearch community In 
universities. For your "infoTmatlon,. I thlrile in the Vaar to come>'1978>^ 
and the years following on, there will be^ an 'rahanded/ Air Force research 
program la universities , ani APOSR will ':be 'tKe**key itistrument for the ' 
Air Force in reaching Ifhe universities w^pr Chls program. I singly tell 
you that to alert you. ^ Many of you are. In^pervice, some of you may by . 
that time be out,.b.ut djp#^t forget OSR. It's a place that \r±l\ be ■ 4 ^ 
available. The new research program 1^ being sponsored by the Department 
of Defense • J can tell you what 'the plicinlng was wKen I was a part of 
the system, and it was that the fitst year would be 33 million dollars 
11 million each of the services for expanded university defend^e 
research^ t;he second year would.be 50 million ylth whatever proportion 
would go equalJ^y to t\i6 services, and the third year a 75 million dollar ' 
programj 25 million *in eactrsof the services;. 

Now if thla pr^ram comes to OSR (and they're still talking about 
it~Dr. Allein/and pr. Gopc^a are still dn place), we're going to need ^ 
some heJk) in encouraging people to do meaningful research that has 
just^fidatlftn for t^e Air Force — not for the National Science jPoundation, 
not for ^he Natipftrf'Institute of Health^and it's OSR's role to manage 
a program of thia^^miid which includes x^lversity research and other 



research organizations working £or the 'Air Foirce to assure that this 
Is coupled With the needs both current and future of Air Force labora- / 
tories. The prime laboratory that I have been' concerned vrLth and for ^ 
\^ich I'm most grateful because they have made it easy to do my C9upiiing 
Job is the Human Resources Labotatory thre*ugh its divisions. It is 
pother evidence of that coupling that I*m here today and that OSR can 
have a small part ^in fostering a program that invented the conce^t^f 
having a meeting. ^ The work was d6ne here' in the Personnel Division,^ i 
and I'm able to take all this credit simply because there was a concept^ 
in OSR to expend some resources in trying to improve the coupling, and , 
OSR's been at that point. 

Vd like to ^ke one introduction. I'm hei^e talking for OSR* as if 
I belonged. It's correct that I am a retired person and not a program 
manager anymore; I'm almost a free citizen. I ve got under two weeks, 
I think, to finish this year's quota that they've allotted me. But 
Capt Jack Thorpe is the official and substantial representative of 
AFOSR — you may have known him as a substantial member of the Flying 
Training Division program — but he will be with us and he is the program 
manager in the^ area in which this meeting operates. So if you have 
ideas and you want to sell somebody, don't tell me, tell him. Jack 
will be fomenting this program to the best of his abilities, and we are 
convinced in OSR that they're substantial. I really, as I said, have 
nothing to say other than welcome' and get with it. * 




WELCOMINGl REMARKS 



V 



Colonel Dan 4). Fulgham 
Commander 

Air Force Human Resources Laboratory 



It's a great pleasure^ for our laboratory to host this meeting. I- 
came down here Jo^th some intention of makin^g a few opening remark's and.^ 
remind you of the importance of this kind of work, but seeing the people 
in the audience— I think I probably know 90% of you—and since this 
isn't Sunday, there's no sense in me preaching to the choir today, t 
would like to welcome you and tell.. you I believe that, as psychologists, 
you're in very gopd hands. Ty Newton's a physiologist; Dr. McCormlck 
will tell you that I'm more physiologist than psychologist, so we think 
we can probably do you a good turn. But we are very pleased to have 
you here. ^ ' ' 

Charley made sdme remarks in connection with the demise of personnel 
research except for the small unit that we had left at Lackland. When I 
came into the organization back in 1971, I started asking questions about 
why should the work that apparently was so important to the Air Force 
have fallen into enough disfavor of support that we actually, wound up 
losing a considerable organizational capability. I think Charley, if 
I'm correct, you went from about twelve hundred people down to 800 and 
finally wound up with about 250 left at Lackland when they disestablished 
the organization. And I think that probably one of the major reasons 
that led to the lack of. support at the higher maifagement levels of the 
organization was that the research efforts got too far from the user 
requirements. It seemed that when it was time for the user to stand 
upland be counted and support the laboratory, he couldn't find enough 
usable research that was being directly applied to some of his problems. 
I think that probably one of the* things that we have to guard against 
in this business more' than anything else is the production of useful 
but not used research. 



Now wd've talcen a new tack in this laboratory in that we try to 
ensure that: when we start working on a useu, problem, he is convinced it's 
a probleox, that we' share that conviction, and we try and draw him into 
our research with us. And I think that that has paid off enormously 
for us in that we're getting a better pickup on our product than ever 
before. Now, sincfe I'm principally experienced in the flying end of 
the business, we, ©f course, have beei^ very, very much interested in 
research, over time on the performance of the pilots and aircrews. I 
was reminded by a' colleague from the University of Michigan recently 



that we?ve been working on objective performance measurement £ot 30 
years iij flight regimes and we're no closer to having a viable system 
than we were when we started^ So, something that I think you'll be 
hearing f^out today — hopefully you'll mention it —is the pilot skills 
maintenance program thaty^e're trying to generate. We're trying to 
draw a lot of this humanr performance under an umbrella program that 
we're going to call Skills Maintenance and Reacquisition training. 
Now*a key element o^ this-^ste^ number 2 after the identification of 

«^tljie skills in which we're principally interested — is^ the measurement 
of performance in those skills. And hopefully, for the first time 
(and we 'have some indication we may Jjd successful this time), we're 
going to convince tlje Air Force to let us scientifically or tectmically 
manipulate -these skills and their performance and measure the effects. 
From this, hopefull^, will come tfie data base that we need\ Then we 

^ nped to determine what kinds of training programs, what combinations^ 
of. media, and what kind of a training system we need in the airprew 
area. ,1 think there'll be a great deal^ of spin-off from this into the 
other areas of^ performance measurement/ as well . 
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INTRODUCTION TO BCEYNOTE SPEAKER 



Colonel Tyree "H. Newton 
Chief, Persoijnel Research Division 
Air Force Human Resources Laboratory 



I mentioned earlier, t>hat" in order to get something like this off 
thfe ground it, takes a lot of, people doing a lot of thingp. The' prime 
mover for this^ symposium was ^r. Leland Brpkaw. It was his idea. He 
discussed it ^over a year ago and it^ kitid of faded for. awhile, -and then 
he br/)ught it uP again, and he kept with it. He's the one \^o made 
the contact witn Dr. . Hutchinson, he pirovided the theme and the format 
for this symposium*, a&d it's through his persistence that we're here 
todayy Dr'. Brofcaw has been with this organization, or the precursor 
of this organization,' since 1946 as a civilian; Prior to that time 
he was with it for 3. years in the military, so he know$ the business. 
He's held virtually every type of <job In personnel research and, he's 
presently the Technical DiYeptor for the Personnel Research Division. 
It's with pleasure that I introduce to you- Dr. Leland Brokaw, who will 
give the. keynote remaijks for this 'symposium. . 



5 I? 



IV 

c 

4 

KEYNOTE ADDRESS 



Dr. Leland D. Brokaw 
Technl)cal Director ^ 
Personnel Research Division 
Air Force Human Resources Laboratory 

Col Fulgham warned about preaching to the choir and I find myself 
in that somewtSt unenviable position, ^but it seemed, to me that a few 
comments to perhaps set the tone for this meeting would be in prderT 
I realize a keynote speech is* supposed to arouse your passions and your 
enthusiasms, and we all go forwaii'd to defeat the foe and all trarse good 
things, so this really isn't a keynote; this perhaps is more of a foot- 
note. In passing, I'd like to point out t;hat numbers of us have heard 
an announcement proffered by my friend. Feed Muckler, who is back there 
in the bleachers someplace. The Navy is having a similar kina of 
meeting focused "^on t^ieir problems in performance ioeasuremefit , Octobet 
12 through October 14, in San Diego, and I look forward to being there. 
It is ot^ hope that some of th^ things that are perhaps conceived /here 
will be bom there. f 

are met to discuss a basic problem in personnel management. We 
ase met! to discuss an intractable difficulty in personnel research. We 
are met to discuss an area in which there has been scientific frustra- 
tion and lack pf^onfidence for many, many years. Yet in a pragmatic 
world of work we see busi^iesses, industries, and military se^rvices 
gaing. about their missions in prpductive ways with apparent happiness 
on the part of the people who ^prj^ for them. So why then are we making 
such a big deal of developin^g ways of objectively measuring performance 
on a job? Is it because we lack the ingenuity^ is it because we do not 
perceive the true comiplexity of work environments, or is it because we 
are making the job too complicated for ourselves? Col Fulgham suppor^lseti 
us in October of 1976 when we launched a" program in criterion develop- 
ment. He knows that we know that the probability of our finding a 
glorious solution is relatively small. He knows, as we know, that if 
we do find such a solution, it will be to the considerable benefit of 
most industries, mo^t indui&trJLal psychologists, laost organizations. 

Our goal-) is to develop a methodology for the collection of job 
perforinanc^data for u^e in the validation bf Air Force selection and 
classificatl^tv^evic^^ parochial*, it's narrow, and it's our 

problem; it's t^^jone we want to v talk about here today. 

There are' three reasotis we want to do this: First, changes in 
training technolo^ are ^^lowly destroying technical training performance 
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as our criterion to be used in the Validation of selection and classifi- 
cation tests. If we look at pass/ fail we find that the PQ splits are 90 
to 10 or worse. . -Air Training Connnand has- tecognize.d ou?- problem. They 
are continuing to develop a continuous numeric score for many of the 
courses at some cost to themselves ' 

^ Secondly, We have recognized ever since I started jchis busin4s.§, 
longer . a^9/ than most of yoi^ have been here, that the technical trafn- 
^ing grade as a device for the validation of a- sele^ct ion instrument is 
f an interim kind of critefion. The objective of selection, like the 
objective of training,- is to ^ut a competent worker in a jQ.b. While 
it is true the completion of training is a hurdle that you mifst .get by . 
to get to the job, there is as yet very little demonstration of 
rfelevance of the selection or the training for the job. -We must, 
generate a syst^ that will permit the judgm^r\t of such relevance. 

\ 

* The third reason was forecast in my opening/ comments . A research- 
' problem exists here, ad hoc developments for thd purpose appear in the 

^ literature by the thousands, but there does, not appear to be a continu- 
ity, a flow, which establishes systems which can be applied objectively 

^ by comparatively untrained people wl^ich will -generate useful data for 
our purposes. Assessment centers fcr the identification of managers 
or the pinpointing of places where managers need training are very 
popular these days. We thought about assessmeat centers ftir ,4>^r>baps 
45 seconds and Concluded that the ponderous nature of the time that ^ 
they take and the amount qf money that they cost render^ them undesir- 
able as useful measures for the validation of enlisted ' selection 

'measures in the Air Force. An eminent psychologist^, whose name I can't 
retaemb^r, has contemplated this problem and he has said, "It's going 
to cost you a lot- of money to collecp^ performance d^ta' to use for a 
criterton. B,ut be that -as it may, if that's whatsit costs, go ahead 
and spend it." Well, these are nice, brave wo^rds -for a guy who doesn't 
have my budget . < 

V ■' ' 

In our own program, our approach has beejcj* classical. I'm afraid 
we've shown very little ingenuity. We' re' starting ^rom all the well 
known placed. .But Xt is our intent by doing this to tie together the 
shreds we find in ^he literature and to build a basis for further * 
progress. We've always got an eye on the checkbook. It is our intent 
to balance' cos:ts to get results. If we are completely successful, we'll 
have a straightforward, inexpensive , objective way of collecting the 
kind of data 4:hat we need. 

Now you all kiiuw: L.ha/ theie ai^ ^^i-tormance measuring systems 
operational in every uDganizatlOu fot every kind of people in- these 
organizations. But thfejre are difrereuc^s between those kinds of data 
and the kind that we nfeed for the validation of classification devices. 
We need devices that Jre sensitive to individual differei^s in job 
spec:^fic skills. it's possible, we' need to measure tW^jk skills in 



a way that is uncontaminated by the personality and the motivations of 
the incumbent. At the same time we need acLso to measure that ^tivation, 
the. drive, the initiative, so that we can moderate, If you will^ the 
aptitude data that we collect. The perfoirmance evaluations used In 
ope'rational programs tend to be more generalized; they tend to be « over- 
all measures o| productivity or perfoimance. They tend- to be focused 
on promat ability rather than on the things which make the current job 
really well done or not well done. And, we ttave another problem. 
Insofar as a supervisor cannot hire or fire or promote* unilaterally, 
insofar^ as a/supervJ^sor is not .Qulp able for high ratings, insofar as a 
supervisor depends upon his people for his own production, there will 
be a tendency for him to rate high. When ratings \get high they'^lose 
their yajriance, and when they lolbe their variance they lose their 
predictive efficfiency. We find this in most military performance, 
programs. , ^ s • ' * 

This conference has three major objectives. First., to share our 
areas of concern and difficulty, that we may jointly explore for 
economic solutions. Secondly; to .review ongoing efforts In the 
Personnel Research Division for the ell^tation of constructive criti- 
cism. Thirdly, to foster common attac)cs upon our* comm6n problems, the 
best approach to this business. WitJ^ the experience and the expertise 
provided in this group, we'll have bet t§5 chance than we've ever had 
bef ore really begin to cope with some of the basic issues of this 
matter. Let us move into the 'pre£^entations t>f this symposium with an 
awareness of the difficulties o^the area, with confidence that there 
are ways to solve them. Let us|/be critical^ in our search for effective 
techniques, and let us be alert for the positive things in every 
presentation that we'll hear, . ' • 



AIR TRAINING COMMAND INTEREST IN THE CRITERJON PROBLEM • 

A . . 4 _ • . . . ■ ' 

Oonald E. Meyer . - 

Air Training Connnand • > ' ' 

" • , Randolph Air Force Bdsp^, Tjexas • - ' 

* ' * * 

The main theme of th*ls symposium has to do 'with performance ' #> 
criteria as they apply to personnel selection and classification, and' 
you may be assured th^t the Air Training Command has vital and continu- ' 
Ing interests in these areas. But after the selection and classification J^^^ 
process is completed, the Air Training Command is faced with providing > ^ 
the most effective and economical training possible. Consequently, in ^ 
recognition of our extended interests, D?:. Brokaw gAVe' tne permlasion to 
change the thrust of my presentation to the need. for performance , 
criteria for training purposes. -fi^ * 

% * . 

As many pf you know, the Air Force has been committed to the use 
of instructional system development (ISD) since about 1^70, first by 
policy statements from the Air Force Chief of Stdff, and more rcceijtly 
by Air Force regulation.. Additionally, conceptual guidahce is given in 
.Air Force Manual 50--2, and "How' To" information for application of ISD 
vo course development is provided by Air Force Pamphlet 50-3JB. An ISD'ed 
•course is based on the exact requirements of the specialty for which 
the training is provided. It is, a key to the avoidance of unnecessary 
and therefore wasteful training. Avoidance' of waste has always been 
important to skillful and conscientious course developers, but now 
becomes a necessity due to budgetary restraints. 

Although the Air Training Command led the 'Air Force in the use of 
ISD in course development, we are still beset With many problems. Better 
training for ISD practitioners is a continuing need. Additionally, 
ISD training *for management personnel needs to be further emphasized 
to make management more aware of the time, effort, and resources that 
must be invested in a really firsts-class ISD treatment; and, of courde 
a Idealization of the efficiencies that result ,^ i.e. , teaching precisely 
what is needed for the job. These are rea]>s^roblems , but solutions 
come readily to mind and there is hope that If-^ot by, edict, perhaps 
through osmosis they will be solved over time. 

The biggest problem and the one for which I can see no near term 
solution lies in the eariy piiasea of applying the ISD process, the task 
analysis. In addition to being the first step in the ISD process, it 
is also the most crucial , for without the proper data base, expressed * • 
in usable detail, the effort rests on a bad foundation. ,The^ result, 



though perfectly Executed, will likely fall short of providing the most / 
cost-effecti:v:e training, possible, i.e., it may teach either more or *l^ss 
than the Skills required on the job. The lil(|elihood is tD&t the course 
will contain more tnan required, and that is^asteful. Npn-ISD believers 
scoff at this idea by asserting that no one can ever know too mij;ch. 
agree with them in principle, but the notion assumes that having once 
been isxpose/1. to- a 'ski!ll or sjiibject matter in a school situation, ^t^ls 
retained for application at some later time. This premise seldom .ftolds 
true. Again, what is needed is aiji accurate and reliable means to 
identify the performance Requirements of the job. In theory we know-how 
to do this, but An practice some ^ement^ are missing. We do not have 
access to task analyses f or most^ of the sl^ills we train. And with an 
obligation to conduct some 3,000 difjEerent^ courses, of which about one- 
third are revised each year^ it is doubtful, that we will ew have 
conventional task analyses for this! purpose. Our budget simply won^t 
accommodate this expense. Let me explain how we presently do business, 
what the constraints atre, and what heeds to be. improved. 

One of the' prime documents used in course development is the • 
specialty training standard (STS) . This Is an Air Force publication 
used to st^dardize and control the subject matter ^content and level of 
training perceived as needed to achieve the skills and knowledge required 
for an Air Force specialty. It is prepared by the particular ATC 
school responsible for the rrai^ing an3 then circulated through the- ^ 
major Air Force comsjiands for review and coordination, after which it 
is published to become a quasi-contract between ATC as the producer 
and the MAJCOMs who receive, our graduates. 

The STS is a widely u^d document. It has been around for about 
25. years or so and has wide acceptance in the Air Fotce. It provides 
a listing of the knowledges and skills that should be possessed for 
an- Air Forc^ specialty and, as such, it provides a start point in the^ ' 
development cycle. The STS is used as a basis for resident Vourse 
development, OJT, follow-on career development courses, and other 
functlotls such as development of the specialty knowledge tests, which* are 
used for promotion considerations. It is> a useful document , but it 
d^es have several limitations that should be given a great deal of ' 
arjtention. \ 

\ \ 
The first and most obvious is the fact tpat the STS is developed 

bylsubject matter specialists who rdly on their own backgrounds and 

expWience to determine what It should contain. I can^t knock experi- 

ence*-^ltt*s a valuable asset-^but frequently people with similar ^ 

experience backgrounds have entirely different views on the same topic. 

Also, even" though the people who develop. the STSs bear the same AFSC, 

some of them have had different experiences during their careers and 

this also leads to disagreements. Who is right? The outcome is 

usually arbitrary, but predictably represents the views of the highest 

ranking, most articulate, or vociferous member of the team developing 
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tbe STS. Errors' made are generally on the conservative si^ei and that's 
vwh^ij^the MAJCOMs^on't take issue i^th an STS during coordination. The * 
trnnlng is seen as adequate even though it might be of vrider scope, 
and- depth than would actually be required. We h^ve had a lot of help 
6i\ this particular problem, based upon AFHRL "research in improving the 
efficiency of our occupational auwey techniques, I'd Ilk© to briefly 
suitanarize some things that are happening that are encouraging to the 
belief that^the STS can be made more objective -rtian it now is, 
Petiodically, the Occupational Measurement Cfenter, an ATC organization, 
conducts occupational sti^eys. All of the enlis.ted AFSCs in the Air 
Force with authorizations of over 100 persortnej. in an occupational 
specialty are surveyed,^ This occurs .at about 3-. to 4-^year intervals. 
An Exhaustive listing Ajjties and tasks for a particular specialty 
is developed by a group o^ senior 'and knowledgeable personnel in each 
especial ty gathered from MAJCOMs'Air Force-wide, 'The listing is the'n 
put into a survey foinnat and sent to the field where performance data 
are gathered. Prior to the AFHRL research in^this area, occupational 
survey reports resulted' in voluminous machine printouts and addressed 
only the number of airmen performing the tasks and the percent^ of time 
they spent on them. Though thejf- provided- reliable data, ^ these print- 
outs proved tedious ^o analyze and incomplete for use in curriculum 
.development. Course designers still had to base their decisions on many * 
undefined subjective factors such as "t^sk crl^icality, " ."task 
importance," etc. ' ' . 

The recently developed product of HRL research promises to virtually 
automate the' decision making process. The research has identified' and 
quantified the major factor^ of the prev^.ously subjective judgment^. 
These new factors, task delay tolerance, consequences of* inadequate 
performance, and task difficulty can be statistically combined with 
the old factors to yield a training priority index. This index ranks 
each task in a specialty in the order of its priority for training.' 
Ftom these data, a fairly obj ective picture of what people in the field 
are actually doing and the Implications for training can be obtained. 
The' Command has recently^ developed a procedure that uses thfe occupa- 
'^tional survey data to construct s^Jecialty training standards. At 
present, tlie procedure is being service tested at several of our 
technical training centers. If the present service test proves the 
technique successful, a big obstacle, that is, the subjectivity of the 
STS will have been overcome. This will give us a certain amount of 
assurance that the STS is baaed upon actual field requirements rather . 
than what someone thinks those LequlLements are. 

Even witfi this imptovemcui , howevtL , the STb Laslfe-4rtems are too 
broadly stated to be used In the development of behavioral objectives 
for efficient training. tor example. In one of the electronics career 
field STSs, a task statement says "Align the system." This Is 
important maintenance 'function and it Is simple and undei^tandably 
stated. Upon a closer look, howev^^r, we find t^iat there are some 50 



alignments that can he made on a given piece of equipment. You can 
reaJlly see the dlleiama faced In tfylng to apply ISD with that kind of 
Imprecise data base/. The STS task segments are Just not specific enough. 
The course developer 1^ forced to exercise subjective judgments that can 
v^be very wasteful terms of over-training or dangerous In terflis of 
under-t raining . 



What we nee/d Is a method that will translate the task statTements 
of the STS intor task analysis- type detail usable for course development. 
The process myst be reliable, fast, and economical. I have tflteen a ' ^ 
classification of ni^e different approaches to task analysis. This 
c).asdlflcatlon ranges all the wa$r from on-site observation to a single 
Subject matter expert making. a detailed break-out of task data. Each of 
these apprpaches has Its advantages and disadvantages. The most 
reliable topr6ach. I.e., on-site observation by a skilled analyst. Is 
prohibitively expensive; the least expensive approach, the subject 
matter yxpert. Is too prone to persbnal^blas t<\ be creditable. The 
solution we seek riqust exist someplace between these extremes at a 
point where we could sacrifice an acceptable percent of reliability for 
a greAt enough ^reduction In cost to make the pro(*M4^ affordable. 




' ' / We need the help of the research community In the o^velopmeift and 
^ validation of a technique or techniques/ to solve this problem. The 
training establishments of the services would be the most Immediate 
b^eflciary, but there ^re other applications as well: the production 

j-ob performance aids, the production of maintenance Instructions for 
technical orders and perhaps, since the task analysis data we need for 
/training Is closely related to the performance data needed for the 
development of Improved selection^ assignment techniques. It Height be 
possible for a cqntrlbutlon In this area. I would »urge that you keep 
this Is mind as. youyshape your research programs. The refinement of 
present task analysis techniques or a breakthrough In finding a new 
approach that would re^tilt In economical^ and reliable task data In 
sufficient detail to be used In course development Is sorely needed 
and vlll. require at least as giseat a research effort as was expended 
In the lmprovein|nt. of the STS. ^ ' ' 
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THE CRITERION PROBLEM: A PERSONNEL 
MANAGEM^T PERSPECTIVE 



Major Wayne S. Sellman and Lt Col Willibord T. Silva 
Air Force Military Personnel Center 
Randolph Air Force Base, Texas 

^Within the Air Force, we are confronted with the same personnel 
problems as any other organization, whether large or small, public or 
iprivate — that of shaping and adapting available human resources irito 
useful and effective manpower. In that regard, the very multiplicity 
of skills required by the Air Force poses problems in personnel plan- 
ning, .training, and manpower utilization which are all but unprecedented. 
Personnel requirements change rapidly and on a large scale, and are 
dependent to a large extent upon technological advances and the inter* 
national political situation, t ^ 

Obviously, Air Force personnel management is a highly complex 
affair. As you know, to cope with these cotoplexities requires creative 
and innovative personnel research — research which addresses all aspects 
of the personnel life-cycle: selection, classification, training, 
performance appraisal, promotion, and organizational development, Such 
topics are of great interest to us — an interest engendered from two 
basic sources. First, we are users of your product. Our effectiveness 
as personnel managers hinges on the successful application of techniques 
and procedures developed from past , personnel research. 

Second, we are sponsors of your research. In that role, we serve 
as the liaison agency between you and the rest of the Air Force 
encouraging, explaining, and extolling the virtues of research and its ^ 
applications. « 

"iphus, we have a very symbiotic relationship with personnel research 
scientists. We depend on you for timely and efficient solutions to 
management, problems as well as for input into the formulation of personnel 
policy. You, la turn, depend on us as sort of public, relations experts 
who ensure your various efforts are understood ^d appreciated not only 
across. the Air Force rank and file but at the highest echelons of Air 
Force management as well. So, we wfere especially 'pleased to accept 
the invitation to speak at this symposium and share some qf our ideas 
and perceptions with you. 

Now, to the subject at liaaJ. We were asked to comment on the Air 
Staff interest in the criterion problem. That interest can be expressed 



in one word — considerable; in fact J t>o overstate its importance to 
.personnel management would be literally impossljJ)le . How we do business 
In personnel is to a large extent determined by the criteria used in 
personnel research. Without adequate criteria, personnel fvinctions 
derived from and dependent upon that research would be less effective 
and efficient. In other words, the magnitude of the contribution -of 
personnel research to Air F6rce personnel management is determined, 
for the most part, by the adequacy of the criterion" measures evolved. 

'Having now established our interest in the criterion problem, 
perhaps it wotild be appropriate fctr us to identify just' what we mean . - 
by a criterion- Blum tod Naylor,* (1968) defiiKi criterion as' a "measure 
of .the goodness of a worlcer." Don't we wish this were so in the Air 
Force? In industrial personnel research, the criterion, that is usually 
used concerns the degree to which a worker can be iotlsider^d succjessiEul 
on the 'job. For example, the 'criterion might be sales figures, number^ 
of acceptable uiiits produced, or ^ny other measurement of work accom- 
.plishment, or .lack thereof. Unfortunately, in the Al^ Force we have 
no overall measure of job success or productivity although one has been 
sought for the last 35 years. 

Other definitions of the criterion may also be found in the litihra;::; 
ture. Guion (1965) defines it simply as "that which is to be prediptfed," 
while McCormick and Tiffin (1974) ha^e describe^d it in terms* of "a ^ 
dependent variable." It would seem that the Air Force rather pragmat- 
ically subscribes to these latter two definitions. In practicgj^ our 
primary criterion is success in training; its rationale is tha?fit a 
person is adequately trained, 'he will haye suffifeiient knowledge to be 
able to' successfully perform his job. 

Although much work on the criterion problem haS been accomplished,, 
especially in measuring success in training, pei;haps the time has come 
to shift emphasis and explore other types of criteria — criteria such as - 
atf'itudes , motivation , satisfaction , leadership, accidents , absenteeism, 
and rates of promotion. Take the lattet two, for example. All other 
things being equal (and they almost never are) the employee who attends 
work regularly is more valuable to the organization than the o^e who 
frequently misiies work. If patLerns ot absence could be reliably 
measured, they mi^hL seive lo open a rtew dimension in military 'selection 
research. . ^ 

•Moteovtsr, evtti. Ki..,xn^ii lI*^ Ml ti^ice u«ies a weighted Jtactor promotion 
aysLem for enllsLeJ pt;rsoiu*el, len^^th of time before promotion occurs, 
or number of times , oasldered l efor^ promotioLi selection mjfght be 
measures of promo tabl 1 1 ty Lhac coul d be used . Admittedly , because of . 
constraints unique Lo the Air torce, such criteria may not be as eAsll^' 
measured aria possibly noL aa .llrecLly relevant as if they were industrial 
criteria. Nevertheless, j ertiaps more attention should be directed 
toward their possible kxs^ . And, of course, there is still our old 
friend, job producLl v i Ly . liveii i iiough past efforts haven't exactly 
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yielded a breakmirough, pursuits in this direction must be continued. 

; Recei^tly, wsleption research In the military services has been 
- cYitl(ji9sed by thV^efense Science Boa^d ds well as other committees- and 

working groups chartered by the Office of 'the Director, Defense Research 
: aad: Engineering, for apparent lack plftgi^ejss. Th^se groups point out 
, t'hatfVali^itles are no higher: toda]^ , on the average , than they were 'a ' 
'decide aigo. 1^. is .commpnly accepted, although not nece'ssarlly by tesi:ing 
tesearchers, that thc^ reason for this- situation lies in the types of ItMit^^ 
that *a]:e /used as predictors (i.e., ^we^ave reached the .8tate-*df-«the:-cLrt) . 
However, another equally JLik^ly explanation may be in the way in which 
the criterion problem has been, handled. Psychologists have traditionally 
sought *'the criterion." To do that we have attempted to combine severals^ 
subb;rit9f:ia into crn^ .overall measure of job performance; But, as we 
havi^ become more iBophisticated, we have moved toward a position that 
job succgss is multidimensio nal I n nature. If^^this is so , thfen i t _ 
woul^ follow logically that drlterla must also be multidimensional, v ... / 
Coui^ it be fhat one yay to erihanc^ our seleiction and classification 
stfatf^f^es would be through tl^ use of multiple criteria? Too, of ten, 
' we dci^not use all'^the job- information available in the selection of ^ 
crifraria* True, time -and cost considerations copie into, play, b^ ttiof<^r. 
effort shouljd be esQ^ended in selecting criteria appropriate for each , ^ , 
individt^al* mi]/l±ary occupa^on, not just .using success in training as- 
the catchall cric^ripp for/all of .them. ^^*sl - 

• • * ■ ' .- ' , * ■ ' ' * - . ' ' . » 

In this regard, we believe 'that .jafie of the best 'statements of this 
point was made by Wallace ahd^We'it^z in the 1955* Annual Review of 
Pi^cholog^ : "Tlie criterion problem continues to lead all ottjiers in lip 
ae^tyici^ and to^ trail most in terms of work report^ed. It seems pi;obable 
that almost all investigators how recognize the /impottancejiqf develop- 
ing, apceptatile ^criteria and submitting them to the greatest scrutiny 
^ahd cdrrection. Unfortunately, a reviewer must also conclude that the 
pressure of getting things done is still wooing many Into the convenient . 
device of accepting the criteria at hand and hoping it will -turn but 
all right Unfortunately, this situation is even today, some 20 years 
later, still the rule rather than the exception. 

Now one final word. about, xhe selection of criteria. Brogdeh and^ ' 
^Taylor (1950) have ""Id^ntif led ten pijor orlteribn prol?lems encountel:^d 
by. personnel researchers. Qne of these is . sponsor acceptability— 'the 
selection ofv a criterion that is meaningful and fi^tlly acceptable to 
management. ^e«, would sugg^t that today's researchers, particularly 
^thode in th^ militrary envlrc^i^ent » are not as sensitive to this^consid- > 
'eXBX.^oti ^ tllf^y could and should b^. For exaiit>le; in planning studies, 
. how often do sciefltitists interact with tesearch users in' the selection 
of crl;tctrla. Probably not very often. A more common occurrence might 
be the scientist selecting the criteria and then informing the user — if * 
eyen that much ^coor<^nat ion goes on in the research planning stages. 
, Clearly, here is an area where research cap be mad^ v^ore, user ? 
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.oriented — the user must beVliivoived In the selection of ''accept^at 
and relevant" criteria. \ 

'^The isftUe 6i relevance introduces an are^a of criterion technolo^ 
alluded to earjtler ^ l«e« operational /mission ef fectivenftis;" Using ' / ^ 
.»the best criteria available » we have selected t classified i and trained 
a highly capable personnel force and sent them to the field with 
assurances to conmanders that these people c^ do the Job* What now? 
How does the commander know that; the Job is beiiig done, or, even mote 
Importantly, that the mission will be accomplished when or if the horn 
blow6? Every commander Is seeking that evasive assessment ,of organi- 
zational effectiveness which represents the operatloigiallzatlon of the^^ 
skills and capabilities of hi^s personnel. • * 

' \ . . / . ' . ■ ■ 

TypicalXy> we In the military have assessed overall mission 
effectiveness In terms of the four factors shown in Figure 1. 'For the 

combat -unit All of these ^re relevant ; - fox eupp^rt-unlts-dlffeteilt—^ 

combinations of the four factors are more appropriate* However, 
'f^gardless of the unity's mission or function o;ie factor remains 

constaitt^ — personnel. . • ^ ♦ ' 

'■1 * 

We make our evaluations of the non-personnel factor in faifly 
quantitative terms using computer modeling, engineering tests, combat 
experience, and on-site Inspections. Our assessment oiE the human 
factor is much less sophisticated* War games or BXercises and opera^ 
tional Inspections are our typical tools, but these are subjective at ^ 
best as well as time constrained, When we consider that in a yearns 
time>20% of^a unit's personnel may h^ive changed, the effectiveness . u 
tatin^^ecelved 12 months ^airlier takes on an entirely different » 
'perspective. Thus, the requirement for quantifiable, integrated, 
tiJme-sensitive criteria for organizational effectiveness remaitis a 
technolofgy need. «. 

^ .The litei^ature on organl8j|^H|pn Is e«en- 

sive and, because of its ubi^q^KoBsness, has made application difficult ^ 
and somewhat limited. While organizational criteria have. been 
-described in terms of *systein inptit/output/process variables, identic- 
fication of potential standards aldne is not enough. Such identifica- 
tion must be followed with the. development and validation of reliable 
and relevant criteria of organisational effectiveness. Bowset:, in an ' 
Augusts, 1976, review concerning criteria of operational unit effective-: 
ness, summarizes the requirement quite succinctly: "The basic problem, 
of defining organizational effectiveness within the U.S. Havy (all 
Services) requires considerable research. The framework established / 
for evaluation of criteria is g^^neral enough to fit most org^inizational 
criteria. However » because It is so general, it may not provide stiff icien 
structiure for evaluation. The state-qf-the-^art concerned with eyaluat-, 
Ing orgap'^izatiGfnal effectiveness Is primitive enough to require 
development of criteria in orde^ to support organizational research." — 
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Our latter excurslot^ Into organizational effectlveness/^as obviously 
not Intended to pro>Mde a learned treatise on operational (drlterla 
technology, 'it^was, rather, designed to sensitize you to/a legitimate 
user need. We inust i^ot forget that the personnel plpellc^e extends far 
beyond ItS'^ln^Jut junction. . Indeed, perhaps^ts reach beyond that point 
represents the .'inost challenging albeit most rewarding ^vancement of 
cfltferion technology. 

Inu summary. Air Staff Interest In criteria la tc^.flnd the beist 
one(s), combine them In the most appropriate and Imaginative way, and 
accordingly stteamllne to the maximum extent the w^ we d6 business 
in "hiring, placing, progressing, and evaluating" pur pebple. However, 
as Blum and Naylor (1968) have pointed out, "For yiears, psychologists 
have labored taider the notion that the objective is to flnd^'the 
cirlterlon' in the same way that the knights of King Arthur ^s Round 
Table .we^re charged with finding the_Holy_ Grail.. _ Both have-Jhad-Abdut- 



equal arid limited success." We trust that in the ensuing 9 years this' 
situation has somewhat improved. Certainly Patricia Cain /Smith (1976) 
in her, chapter on criteria in th^ Handbook on Industrial and Organizatlbnal 
Psychology sounds a note of optimism. In any •event, development of 
reliable, relevant, and valid criteria .^f or use in AJ^r Force personnel 
research * (and management) remains a task of paramount importance. It's 
nice to b^ present at *this ^ymposium and to knojw there are the kinds of 
-^people represented here who are capable of addressing this difficult ' 
problem. 

' 7( * ■ ■ 
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ARMY RESEARCH IN THE CRITIERION AREA: 
A CHANGE OF EMPHASIS . 



H.E. Uhlaner, A.J. Drucker, and W.B. jCainmj^», 
U.S. Army Research Institute for the Behavioral and Soqlal Sciences 

Alexandria, Virginia 22333 i 

• ■ •■ - ■ >■ . ~ ■ ' 

during the past decade. Army research^ to develop ^and measutre criteria 

C human performance has moved to achieve greater relevai,ce/to job tasks, 
luding the noncognitTv^ aspects of thes^ tasks and more efficient ^ • 

inqxlementation of performance measures related to Army proBl^jns^— That- 

is, criteria are expected not only to be psychometrically predictable 
'but to show reasonably logical, relevant relationships to the job. 
There is wide recognition that few job performances are unidimensional , 
also an awareness that it is neirher possible nor feasible ^ to test 
^completely all the component tasksyand subtasks of many jobs or work 

"StuationsTTrace'crit^ . 

^developed. Information concerning^^ow well an individual can perform 
the 'tasks necessary to d6 the job is often gathered by means of a 
"(iriterion reference test" — a test made up of items directly related to 
the' job of interest (Bcrycan & Rose, 1977). Adequate and relevant 
statistical measurement of jpb performance is either n6t practical or 
rigorotis; o^ften influenced by noncognitive considerations, e.g., degree 
pf risk taking. Nfew assessment indicators had to be developed and used 
along with^ more conventional methods. Analytic exj^erience has convinced 
V the performance te^t community that there i6 no easy way to ovepcbme 
chronic criterion Validity problems. Only meticulous, knowledgeable ' 
development of accurate descriptions of the relationships between • 
psychdlogical variables and precise identification of these variables 
fan reduce criterion validity problems.. The minimal passing criterion, 
the •way this crit;erion was derived, from the jpb objectives the nature 
of the test items, and the length of the test together make up the 
Assessment system, within' which a variety o£' quantitati've models are , 
used (Macready, Steinheiser, Epstein, & Mitibella, in press). 



The Test Bed Model 



v^r a better understanding of job perfcrmance^iteria it has, 
become vexfy clear that a better theoretical N^ase Is necessary.. The 
senior author has presented a concept/of the interaction of selection, 
training, and 'job design for effecti\^e work performance. His major 
hypothesis is that aptitudes, job demands, and surrounding conditions 
coalesce to yield varying levels of performance. The cohceptual back- 
ground for his hypothesis includes a job taxonomy containing cognitive 
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varlance^and nonciSgnltlve variance, the ad hoc nature of values and 
goals I and the great variety of styles of behavior by wh^ch Individuals ; 
and org^lzatlons seek and achieve goalie (Uhlaner, 1970). ' « ' 
' ■ ' % ■ • ' 

It Is proposed that for many ^applied, purposes,' Including systems 
development, the criterion should a g^en one, rather than the yield 
of pirecedltlg predictors, and..should be eisqpllcltly specified with respect 
to both ..cognitive and nonco^ltlve variance . * ^ 

Elgure 1 presents a test bed model which can b^ deVelQped at. the. 
user ^s' location. The user can Indicate specifications of> the results 
he deslrea. He is provided with a number of negotiable options -leading 
to the same result, each reflecting a different trade-gff possibility. 
The' user makes the final decision as to the option selected (Uhlan^r, ' 

mPh ^ _ . ^_ - ' • \^ - ■ ■ . . ... 

.The test beet model method emphasizies the ouS^m^s of decisions and . 
their consequences for individuals and institutions, whereas traditional 
assessments have emphasized only measurement'^and prediction. The validity 
Qoef fici^nt tells us about the* degree of association between the predicted 
and obtainfd/'C^terion scores. But oftein, from a practical atalidpofnt, 
the number m correct personnel decisions resulting from the tise Of a 

given cutor ^ scoTer-ts more important— than knowledge of the^ Validity * 

coefficient (Cronbach & Gleser , 1965) . , ' 



Achievement Criteria 

,1 



Army Research Institute for: the Behavioral and Social Sci^^nces' 
(AKI) research r4sult^ over tJie decades shbw that^* in general, three 
types of criteria are usdd |to measure achievement: school grades, 
natings, and Situational or^erfpridance measures. . The trend, to no 
bnie*s surprise, has been away from grades and more subjective ratings » 
toward multi-crKaria performance-oriented measurements Table 1 shows 
the relative frequency with, which these criteria occur In reports of 
ARI research ^er a 20-year period. ^' 

Table 1. Type and Freqifency of Criteria 'TJsed 
*^ ' (N - 209_Publications, 1956 - 19,77) _ 



Type of Criteria 



I. Grades 79 (27%) 

n. Ratings ' v* (^7%) 

III. Iterformance i 93 (31%) 

Mutt-LHSrlterlon " 43 (15%) 
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ABILITY FACTORS 
(mental factors, 
skills, etc.) 



X'PERSONALtTY FACTOR! 
(values, interests, 
motivation, etc.) 



WORK & ENVIROI\IIVIENT 
VARIABLES 
(equipment, 
methods; etc.) 




1 . 



ORGANIZATIONAL 
• VARIABLES 
(leadership, incen- 
tives, etc.) . 





SPECIAL TRAINING 
(amount 8( methoid) 



EXPERIENCE 
(amount & type) 



Figure r. Conceptualization of, Interaction^ of human factor system variables as related to^e'rfonntoc^ 



effectiveness, * 
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^ Graded are Used primarily as criteria for cognitive predlctgrs. 
CognitiviB; /factors are those that involve ^ceptable right And wrong* 
answers J^Vjob elements (Uhlanet, 1970)^ Grades aroused as criteria 
for sdlec^liDn/^axTd classification tests, much the samel as in the^past 
(Haggerty'/-:19p3; Maie^^ 1972; "IZeidner, Harper, & Karcher , W56) . The 
recently i^lemented Skill Qualification Testing System (Maier, X^ng, 
& Hirshfer(^||l976) will gradually replace the paper-and-peacil Mflitary 
: Occupatlon^j^pecialty (MOS) tests in the Army, however, thus reducing 
even furthe|{ the need for grades . . Ratings have been used to evaluate 
on-the-job llerformance of officers and enlisted mfen, especially where 
interaction with ofher people is involved.. Selected per formmiee tests 
have been used primarily to measure a more complicated .mix of cognitive 
and noncognitive Job demands, ' , 

Th^B three groupings of criteria are not mutuaMy exclusive and Are 
vintended only to provide some indication of the f^mewoA]^ of the;Lr 
txse— particularly within the ARI. Nate that grades and ratingTaccount 
fpr Vlittle over half (54%) of the criteria used, .^is is due, in - 
Ia3^e>a 3:t, to^ the larger proportion of studies invQlving/schobl 
criteria. Also, current trends, as mentioned before, show'' that training 
"arid other performance criteria are increasingly bbtki^ 
or situational performance-oriented indices. ^ * , 

Grades l ■ ' • ' ^ 

, By^ar the most frequently used criterion in the period'^ust 
followirfg VoriS War Tl^'was the academic: Igrade or the pass-fail training , 
criterion.: The relationship between gi^ddey and on-the-job performmlce 
has consistently never been very high. Yet ^^^rie^school training ia a 
'prerequlsite^for job assignment, the trainee mat pass the course, and 
l;herefore tlxe applied research scientist must , pay some .attexvtiortr^a ~ 
grades pr pdss/ fail measures in training. School grades appear to 
predl(;t best; when training is for Jobs with high^ cognitive demands that 
involve /cleat-cut "right" and "wrong" job elema^LA^r Validity ^coefficient 
tend to be moderate to high between such Jobs antlCBchpol grades."^ In 
sum, grades aVe most useful ia reflecting abirit'y imr^ademlc or . 
cognitive aspects of the Job* f 

Grades in^^ school ^not seem to take into account noncognitive 
factors that rfelate to '"style o^ behavior and performance reflecting 
specified or^implied values and attitudes. Experience on the Job seepis 
to, be most ctlifeial for //specif ic noncogniti^ performance—experlehce 
coupled with the> person's use of his/hpr Indiyldiual talents and j^alu^s 
to achieve goals. ' ^ ' • 

Ratings , , * • 

The rating is <^ne mea^sure of 'effectiveness that seems widely * 
accepted. The essence of a rating is a Judgment by one person or a 



group of persons of the performance of another , individual The rating 
is simple and familiar, but it is also theosowrce of many fallacious 
beliefs among management and supervisors. ARI research for many yedrs . 
has attempted to establish methods for obtaining reliable and valid 
ifStlngs; it has had its impact on many research tasks. However, ntany 
of the fallacies prevailing .in the 50's are still with us. Here are 
some examples together with restfarch-based information bearing upon 
the problem: , * • 

^ . ^ . ■ f - r ' • 

fallacy 1- We can alwi^ys meaningfully rate a person.' s performance 
on 30 to ^0 separate scales . Research results have shown that a large 
general factor dominates the rating even wfien deliberate attempts a^e 
made to measure different aspects of job performance by using a number 
of specific rating scales. Ratei^s typically seem to perceive only a 
single measure of Success, whether it is an actual single measure, a ^ 
formally weighted composite77~or an implicit weighted composite. 
However, recent efforts to develop) performance criteria haVe the practical 
^advantage pf combining related fractional criteria into a composite, 
tending to avoid the ambiguity '^of combining unrelated variables. This 
procedure defines related perfofmatfce measures t:hat are more clearly • ^ 
understood by , the evaluators (Duffy, 1976; Root, Epstein, Steinheiser, 
Hayes, Wood, Sulzen, Burgess / MirabTs^la, Erwin, & Jminson, 1976). 
Criteripn measures that assess indivicfual job performance in terms ^f 
concrete job functions s^em to yi-efld a reasonably aqcutat^ measure of 
perfqrmance, whether or not the measures are subsequently combined 
into a composite rating. Also, multiple evaluat6ri5 are likely to 
increase the validity of performance ^ratings. 

'Fallacy 2. Hard raters render more valid ratings than easy raters . 
In research. addressing this suljjvect, there is very little difference in 
validity of hard and easy ratings, although' hard raters tend*i:o bunch 
ftieir ratings somewhat ^ower on tihe scale (Browning, CampbelL, Blmbaum, 
Campbell, Fold, & Haggerty, 1952a', 1952b). / 

Failacy 3. Bright raters render, more valid ratings than the 
nptYSO-bright , or-' a raterChas ttf? be exceptionally' bright t^ rate .well . 
The research evidence is t<h^t raters of average ititelligence have ^ . , 
rendered ratings as valid as any rating by others. There is^ some 
evidence that; when persons in the lower 16% of the >distribu^ion of ' 
mental abilities rate others, the ratings are not quite so valid 
(Chesler, Brogden, Brown, & Katz, 1952). However, nearly all raters 
tend to evaluate good perfdrmMnce more effectively than poor performance. 

Fallacy A . A better rating can be o^)tained by giving the rater a 
more definite ;£rame of reference . An example of this would be "How ^ 
would you like the ratee to serve under V^^?" rather than "How competent 
is the ratee?" The earlier research answer was that if any Improvement 
results, it was negligible (Karcher, Qa^pbell, Falk, & Haggerty ,\1952) . 
However,. When measures are behavioral ift content and actually relate to 
the expected behavior* and the criterion dimensions underlying such 



measures are clearly Identified, jtHen reliable, construct meaaurdnient 
techniques are effective.^ The work In this area Is still under ya^i;^.: 
and problems vlth the^many theoretical aspects of current concepts «pf 
content an^ construct validity dr^ moot. Ii^. any case ,^ raters seem to 
rate npre reliably and validly when they are aware of the criterion to 
be evaluated. . . . ^ 

i • <. 

On investigation, thus, tmse four commonly held concepts have Hot 
proved to be entirely correct. However, several questions are often 
asked ai^out rating practices and. proceduries that affect the research 
usefulness of the rating. Typical questions and answers In' connection ^ 
with the Officer Efficiency Rating are: ^ Should every mllltary^^f fleer 
be required to show his rating ta the rated' officer? It should make 
very little difference whether \the ratings are shown or made by Iflentl- 
fled or anonymous, raters, provided all ratings are done the same way . 
(CKesTer, Brogden, Brown, & Katz, 1952; Karchet, Wln« 
Haggerty, 1952; Seeley &,Klng, 1956). Ar^ ifatlngs by Ideritlfled raters 
any ^liferent from ratings by anonymous raters? The concensus Is that 
although .there may be an Inflation of ratlpigs wf^n the rdtlngs are 
shown, differences In validity- are negligible. Do ratn^S' agree more, on 
their; evaluations of job >54|ccess If they h^ve had more 'opportunity to 
observe the Individual perfotmlng on the job? The answer Is yes, ^ 
generally, as Implied In Table 2 (Medland & Olans, 1964). 

" ■ . ■ : ' \ '/ . ^ . ■ , . • . 

Table 2 also shows superior, validity af peer ratings, which- have 
proven to be generally reliable and valid, ovjer cadre' ratings (Mohr, 
1975). One can reason^ that fellow trainees or fellow wo'rkers on the 
job are usually In a good position to observe performance, and that 
frequent association In a training situation, even for a perlpd of 8 
weeks. Is sufficient to enable l:he rater to make the judgmeiits required. 

Table 3' shows some of the research evidence for the claim that the 
peer rating Is one of best predictors of subsequent Army performance 
(Downey, 1976; Drucker, 195 7; Parrlsh & Drucker, 1957; WUleinln, 
Rosenberg, & White, 1957)^^ 



TaTile 3. peer Rating Coiiq)arliaons 



Cpmbat ' 


r. 


= .60 


Leadership 


r. 


= .49 


Special Forces 


r. 


« .43 


West Point 


r. 


= .50 


Ranger 


r. 


= .52 



Another Important finding In most rating situations l^pthat a 
fating based on the judgmetit of more than one rater '^s better than a 
singly rating (Kafcher et al. , 1952) . The use of multiple raters la 
•^qulte^llkely to Increase thet validity of the performance rating. 

' 37^" ' 




% 1 r 



1 


HIT HAIWGS i. - V 


; * i 


nil en^neiim mmm. 


M iisniinnii 1^ 




irai(ll =5I|) CAIHIE » = 1?) . 








t ^- . ■. ■ 




• 1 

13 ^ M 




^imATHtt ^ 
MWEH 




s ■ . 

» . . ■ 


H'lEEIK 




JM .11 


lliillEEK 


t 

. ■ j 

31 ' ' - M . i 





tue. 'Fron) KI{i)lHii«iltllln,m. 



However, evaluations with different organizational, perspectives are 
'likely to yiel'd different validity measurea of the indiyldi^al- raitee's 
performance. Mote information is detained, resulting In an even more 
accurate and pd«sibl^ more useful assessment of performance (Duffy, 
,1976). It is the authors* crtivictton tHat ratings should be used^most 
'frequently when the assessment of noncognitlve' factors is involved ^ as 
In the performance of pote^^ial leaders pr the performance ofr fighting 
personnel. v ' • i 

In sum^ rat:(^i>gs aire seen as simple to understand md ;easy to use. 
But ratings permit only relative measurements between person A 'and < 
person B. For go/no go measureineht ,^ we mu^t ^consider the third type 
df criterion — performance measurenfehts. ^ 

■ • . # ' . ' . ' 

Performance Measures 

This third measure of effectiveness is one of the oldest and also 
as one of the newest, has b'ecomd increasingly tacceptable.'^ In 
pri'hciple a perfo>rmance test lis a- Job sample test — similar in form to 
the trade tekt of the early ye,ars in ^jidustrial psychology* The* test 
of performance* in an actual situation, has been applied with growing 
frequency where the need for more objective^ measures is perceived as 
crucial. * ^ • 

The advantages of the situational performance measjite make it a/ 
much more effective criterion measure than the gt^ide or rating, even 
though the development of such measures presents challenging problems. 
With. performance tests, we ""caii approach success/failure limlts-^-a goal 
not reachable with traditional ratings. For example, how many hand 
gi^ehades can the soldier throw on ^target in one minute? Or, how long 
does it take fa squad to capture a specified hill? With such preclde 
information, a commander can better assess the ^performance of Individ- 
uals or groups; wlt:h ratings such comparison is less feasible because: 
the needed reference' point is lacking. 

ItEALtRAIN . A most effective use of performance te'sting is . 
exemplified in REALtRAIN, one of the Army's new an^ extremely success- 
ful tactical tra^Lning systems et al.,. 1976). The jpeasure^ment 
objectives of REALTRAIN Include a specific set of operations for 
observing and evaluating, agreed-upon relevant kind's .of behavior, the 
recorded data indicate wKether or not a clearly operationally-defined 
job or task has been performed. The soldier's performance is measured 
directly — no inference is necessary. Simulated battlefield realism is 
an Important consideration, so the performance bbjectlves for combat 
effectiveness require that: 

(1) L^ders an4 soldiers take timely and appropriate response to 
enemy action in a dynamic combat situation. 



(2) Units achieve effective and efficient Intra- and Inter^unlt 
coordination.^ 

(3} Units maximize the effects of availa^^le weapons cm the^ettemy^ 

(4) Units minimise the ef/ectd of enemy weapons on themselves. 

The REALTRAlIN method provides teali^sm for t:wo*sided, . free*play. 
exercises, with a credible means of assessing 'Casualties..; Infantry ^ 
REALTRAIN exercises are centered around the M16 rifle. Each soldier^s 
weapon is equipped with a 6X telescope (FigJ2), and all participants, 
wear 3^" black two-digit numbers on their helmets. * Opponents try to 
read each other's numbers using the telescopy'. When a man on one side 
identifies a number, he fires a blank roimd dn4 reports the jiumber to 
a controller; the controller then' radios the number to a cpnt roller 
with the opposing force , and the man whose number was identified is 
assessed a^ a casualty^ (Shrlver, Griffin, Jones, Word, Root, & Hayes, 
1975) - Procedures have he&h developed to determine casualties * 
objfectively for the M-60 'machine gun,' hand grenade, M18^1 Claymore 
mine,. LAW^ tank main gun, TOW, DRAGON, and M16A1 antipersonnel and M'-ZI 
antitank mines. ? 'A critical element of the tacticaii engagement Simula^ 
tibn occurs during the after-action review, .\^en, events surrounding 
each day'^ action are discussed and feedback is* provided each indivittaal 
involved £a the exercise. ^ * ' - 





Figure 2, REALTRAIN simulation idehtification. 



f 



REA^^IN Is based on pwb conceptual frameworks. 'The first ^ as 
outXlned|w Uhlaner (1970), specifies human performance in systfems 
terms; tqf second is based on the premise of the performance situation, 
in thip c#fie **j5ucce8S £n battle-" The initial validation of ^EALTRAIN 



(Root et alrv 197 
research at Fort Cht( 
1977) » have indi 
sively and consis 




Army" combat units in Europe arid ^validation 
ifomia (Banks, Hatdy, Scott, ^Rress, & Word, 
hat training effectiveness results are inqpres- 
positive. 'V 



An obvious disadvantage of such -performance measures or situational 
tests, however, isii|hat the;? are difficult and expensive to construct.- 
Despite efforts to^lcilitr^^ admiaistration of standardized job 
elements^ the observer's t^i^M^remains a demanding one. Whenever 
. ' /" possible, ARI relies- on aiitSltic recording of responses. One e?cain)l'e; 
related to REALTRAIN, is the Muitiplfe Integrated Laser En^agemferlt ' ' 
Simulation S^st^ (MILES) Wig 3)V a family of low power, eye-safe V 
lasers which will simulate ttie direct fire characteristics of the ' 
V"^^^16A1 ^ifle, the M60, M2, and M5 , machiij^<guns , the VIPER, DRAGON, TOW, 
^ and Shillelagh missile systems plus^t:^^ lOSfoif tank main guns. A ' 

hierarchy of weapons efj^cts is established in the detector logic—for 
example, a tank main gi^i can^ destroy an armored personnel carrier, but 
* an M16 rifle cannot. This, equipment provides .immediate and accurate 
casuallty assessmient in two-sided, free-play tactical exercises. The 
laser I'firings'' are keyed by the discharges of a blank round; Despite 
the sophisticated apparatus, a knowWdgeable official is still needed 
to ensure that proper procedures are followed.® Thus, a need stlftl 
^ exists to train observers thoroughly and rehearse ^hem repeatedly in 
what they are to do. 
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''Figure 3. Multiple Integrated Laser Engagement Simulation J(MILES) 



Organizational effectiveness , A sotnewhat different area of 
measurement deals with. the diagnosis and evaluation of Organizational 
Effectiveness (OE) i of fen requiring situational performance nteasures 

a largely non-cognitive nature-»-^specially measures of attitudes 
and Values. The Work Enviranment Questionnaire (WEQ) , U6ed on OE 
research, provides attitude roeaentes of the supervisors and the work 
group, gives situational factrors that are related to job performance ^ 
and relates their impbrtance to the Job as perceived by the sigildier 
and his leaders. TJie WEQ has been validated against objective standards 
oi Job activity and self^perceptions of work,^ all of which were in turn 
validated against actual on-the-job performance (Turney & Co^en, 1976). 

The objective of the OE program is to identify and ,to optimize 
those organizational factors in the Army. work environment relate4 to 
soldier Job- satisfaction, motivation, and performance. The objective 
is. met through a five-phase research program, progressively identifying 
and developing: I 

(1) Criteria ojf organizational effectiveness. 

(2) Organizational functioning; structures, processes, and problems. 

(3) Parameters of the OE process. ' 
^j-/ (4) Diagnostic methods. 

(5) Intervention strateg)^- 

■ -I 

The WEQ study was a follow-up of extensive longitudinal research 
encountered over a' 3-year period to develop tiie diagnostic instriflnents. 
Pretests in 1973 provided initial data, validation of the instr^uments 
was conducted in 1974 and 1975, and In May- June "^19^75, an original 
diagnostic survey was conducted in one Army ageticy in the Army Air 
Defense CfomBand. The survey focus^^j primarily on ftorse operations in 
a field s^^^tion. Experimental consideration's were: 

(1) The work was performed by i6-man teams, each consisting of a 
senior NCO supervisor in charge of i4 operators and one Analyst. ^ 

(2) Both individual and team Performance criteria could be 
collected for validation purposes While the team did its Job.^?^ 

(3) A large number of teams Performing identical function^ allowed 
experimental control . 

The Morse operationaT are important to the mission requirements of 
the organization and the representation of the complex semiconputerized 
systems being implemented Artny-^de (Colffei & Tumey, 1976). . . 



- The findings, in general, revealed seven major organizational 
problem areas: peer group norms which fail to encourage good perform- 
ance, insufficient performance feedback, ne^ed for training in 
supervisory technique, role ambiguity and conflict , inadequate 
intergroiip^ communicat^ion patterns,' lac|c of cliear performance-reward 
relationship and ambiguous performance evaluatix^n standards, OE 
intervention was able to alleviate most of these X 

Duty modules * An example of the development of perfolrmance 
criteria is the duty module concept which has the practical advantage 
of a cditmosite criterion combiiyLng related variables that operationally 
define Mrformance measures, to the evaluatofs. The ^uty module is a 
cluster oif tasks that are me^iingfully related though not necessarily 
contained in one job.- In fact, an ARI research project foimd that 
eight jje^ dimensions could be incorporated into a single Job Proficiency 
Appj^^isal instrument designed to assess 39 entry-level specialty fields 
o^l^he Officer Personnel Management System. These job dimensions 
describe specific duties in the areas of Administrative Details, 
Correspondence, Counseling, Maintaining Standards, Training, Supply 
Management, Technical Knowledge, and Control/Coordination (Duffy, 

1976). o • ^ 

NOE. Situational performance tests demand both subject matter 
expertise and psychological knowledge. Imagination and ingenuity are 
required to bring out the desired performance in a hljghly concentrated 
test behavior simulatioa, contrived and presented for the examinee 
within limited geographical bounds. A host of practical problems must 
be solved. One example of a field problem is that used by *Army heli- 
copter performance evaluators. 

The helicopter pilot's task is to navigate or fly a UH-1 helicopter 
over a prescribed route at Nap-of -Earth, subtree top height, at variable 
air speeds, using natural features for concealment. The performance is 
conducted in the field, and three measures are used". 

(1) Total mission flights - a distance/track deviation measure 
which tells the percent of track followed. and to what degree the pilot 
has been off course. 

(2) Individual tasks - tasks abstracted from total performance, 
such as mission planning (Farrell, 1973). ^ 

\ , (3) Special ijidivldual behavlots - a high degree of abstraction is 
often involved here and, for that reason, the measurement of such 
behaviors is most readily accomplished in the laboratory. For example, 
levels of ambient illumination can be varied ^in order to determine 
effects upon terrain recognition ability. 



Besides the practical compllQatlons In measuring j>erformance in 
the xompiex and multidimensional task of pilots, there Is the problem 
;of weight in the valufe of an error (e,g., the operational significance 
pf a course deviation error of 300 meters, versus a. deviation of 50^ "'^ 
meters) . This is a tvplcal ptbblem presented by performance measures 
that are tied to opei%tlon missions. 

^ Despite these practical dlfficaltles, a ^trong/'belief exists 
among performance research scientists in the human factors area that 
further progress in more sopliisticated differential validation pf ^ 
certain kinds of human^ factors performance^ particularly the kinds Itb 
which future officers of the Army may be exposed, can best' be tapped 
by this sort of field/laboratory measurement.^ Earlier we implied that 
ratings hit only a common core of ability. We believe that situational 
performance measures will permit a sharper delineation of differential 
ability, as already evidenced by the Fort McClellan researcl\ project 
on officer performance. 

Peculiar to the militjtfy and to the Army, whatever criteria are 
usedsJ is the f^ct that 3obs\^t be performed under both peacetime 
garri^eiT^an^ combat conditions. One of the biggest challenges has been 
how to secure affective measurement of performance in the combat situa- 
*tion. Conttiat: situations are relatively rare, of course', and, when we 
find them, it may 'be extremely inconvenient' to secure complete evalua- 
tions. Recognizing the importance for military psychologists of 
obtaii\ing measures against such elusive combat criteria, research 
scientists have developed an approach called criterionx equivalence 
(Wherry, Ross, & Wolin6, t954) . The fundamental procedure in criterion 
equivalence approaches is "based on a matl:\em&ticali trxilsm, that wheh two 
measures are equal to a third, they are equal to eaeh other. Criterion 
equivalence studies have led* to the conclusion that the same measures 
are predictive of performance in both combat and in garrison situations. 
The specific techniques of accomplishing criterion ^uivalence are 
elaborated in reports by Gaylord (1953) and^Johnson (1956). 

Systems" Criteria ^ 

Underlying the discussion thus far have been the concepts of 
comparing one person with another, or one person against a specific set 
of Job standards. As our laboratories have become concerned with 
systems and system research, we have become more aware of the fact that 
the systems the Army.wlll be required to manage have very complex internal 
structures, and that if we are to learn how to act so as to produce the 
results intended, we will need new ways of thinking about complex 
systems (Uhlaner, 1960, 1964, 1975). 

Development of the sybtems output criterion has proved to be some- 
what more difficult. The generalized concepts that the military manager • 
or system developer intui Lively intended are very difficult to translate 
into operational t^uns. Systems evaluaticns .are primarily a matter of. 




' judgmeiit' by experts; and the larger tfhe system, the more conqplex and < 
difficult the translation from conceit t to operation becomes. Because 
of side effects and contingencies^ , many of the tasks do not have the 
outcomes Intended. One of the greatest challenges for systems 
psychologists Is to develop meaningful tasks that carry out system 

'objectives. 

From a ssLtuatlon where man has been the fotal point, he has now * 
become a linkage In a system. These systems are also becoming more and 
more expensive not only In dollars but in. time lag. For any^partlcular 
military function — for example. Command and Control~a number of C 
colnpetltlve man-machine systems are being developed on a concurrent 
basis i and they have to be evaluated before they become operational. 
The evaluation* of these competitive systems mu^t be .sound enough to , 
enable military managers, together with the scientists, to make correct 
decisions. as to the appropriate system or subsystem to be parried to 
cdmpletlon or made operative. ^ 

Tlie research dLychologlst has ^,bee^^J^ked to assist ^ establishing 
the appropriate subsets of functions to be performed — the jobs of the 
men .within the chosen system. He Is asked to indicate the klTid of people 
•rfeeded, not only ^Ln terms of talents and aptitudes, but also, where 
appropriate, even in terms of personality characteristics. The researcher 
"is'-asked to establish interrelationships and hierarchies within the 
vSystem, to look at equipment ^d help engir^eers to design it, in order 
to make functions and jobs easier arid more manageable by the average 
person. Concurrently, he is asked to develop tracing programs and 
"devise aids wljiich will, in the time allotted, train each individual to 
perform these functions. He is asked to look- at the activities per- 
formed by the individuals after their training to 6ee whether he can . 
improve work methods. In the meantime, in theory,^ the machines will 
have bpen frozen in their deslgjj|. In practice, all the .processes of 
development are recycled many times. It is the last contingency that 
makes human factors problems more fluid, "^ore complicated, m^re of a » 
challenge . 

Within this setting, the nvLli^ry manager who directs an 'evaluation 
'of the total system or the subsystem is Mkely to accept idbre whole- 
heartedly the research product when it is expressed in quantitative 
'units that can be related t^ his goals and missions. The tot^l impact 
oji the operation is the key concern of the military consumer. We believe 
that human factors research scientists must think in terms of the total 
mission effectiveness of a system, rather than exclus;(.vely in terms of 
the effective performance of individuals. It is because of the 'military 
consumer's end product orientation that gystems research and systems 
development are today enjoying enthusiastic support. 

On the surface, Llic syatt^uxs output criterion resembles the situa- 
tional performance cilterlon, in that both include aspects of, th^ actual *- 
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job. But diBvelopment bf the systems^ output criterion requires pains- 
taking experimentation In the laboratory, before taking the 'criterion 
Into the field, in order> to establish quantitative Relationships 
^between actual independent variables and various aspects of human 
performance in the^ aystem. In, the situation performance measure, 
subject matter experts ate traditionally employed to help assure 
accuracy of simulatlpn fot\reallsm and adequacy of performance coverage. 
In developing th^ svlstems output criterion, operating field personnel 
are used to help assure adequacy of simulation and coverage, and, 
\equally Important, to assist in establishing critical parameters of 
performance for simulation. Measures of system performance usually 
^involve some clearcut .base against Which to evaluate performance; 
for example, accurate and rapid detection and identification of aircrlift 
and tanks. ' ^ . 

, We thlnl^ the most exciting and interesting aspect of human . 

performance oriented s}rstems research lie in the near future* Th^re 
are posslbll^.tle$ for reseatch in the broader areas of- social, govern- 
iinental, environmental regions — to include man-machine systems— in 
relation to each other apd the system and subsystems output . The basic 
framework of huuian performance systems researctuxeflects a philosophy 
of integrated research effort (Uhlaner, 1975). ^uch a. framework 1^ in 
keeping with ttte present day direction of systems, psychology (DeGreene, 
1971), with gr^e^ter emphasis on application of psychological principles. 
This framework provides a particular segment of society, in this case 
the Army, with usable results for the development of effective human \ 
performance systems. 
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FOOTNOTES 
Extraneous remarks VHr. Comm 

n ■ , » ' ... 

< 

1. Originally, I hjad two" charts, 1945 to 1955 and 1955 tb 1976, 
and they show this trend. The nature of the data is pretty 

^ rough. These categories aren't mutually exclusive, so I 

simply collapsed them into one table. 

* • • / 

2. We are tryin&^to get to our construct validity, and this seems 
to be one-way that w^can do it. ^ . 

3.. The references here range from 1957 to 1977'. The external 

criteria here in combat situations is combat training like AIT 
and ratings by platoon sergeants 'and commanders- in places like 
Korea and Vietnam. Leadership and Wept Point were based 
on the same thing; on West Point graduates,* how well they 
performed in West Point, how they were rated by their .peers, 
how well they, did after they got out into the field (quite a . 
bit later). The Ranger study is our most recent and has to 
do with ranger training, peer ratings during ranger training', 
^d now well they performed in Vietnam^ based on the rating of 

^ their immediate commander, usually. We had, one more that had 
to do wtth>s^he) peer ratings of selection for Getieral—l^ we 
really ra^ven^t put that one together yet. We don't know 
• whether the colonels are rating other colonels on the basis 

of knowledge of their performance and how good a colonel they 
are, or whether they know the system well enough to be able 

* ' to* predict who will be promoted to General. ,*/e have a lot of 

problems with peer ratings. They are not very well accepted 
at this time by people in the Army, and there are a number of 
complicated reasons for this. 

^ ^. There have been several Court rulings that have aided this^^ 
popularity. 

5. REALTRAIN is extremely popular with the troops. We're using 
V it in Europe with great success. ^ >^ 

6^ tow is a Targeted Optical Wireless Anti-Tank Weapon. ( 

7. We^ only have, two regim^ifrts rigged up like this. As you can . 
Umagirie, it's a little bulky * and iAconvfenient, but it seems 
ty work quite well. ^ 

8. Ail individual soldier can accomplish the r|^quired objective, 

' but he may not accomplish it in the right way, so you h^ve to 

have somebody out there to watch him:. 



9^ prganizational Effectiveness In the J^y has been so success- 
ful up to this point ,that we are dever«H)ing Organisational^ 
Effectiveness Research teams ip the Army and sending them to 
various areas. . 

lOi We're ^trying to avoid a Hawthorne effect, 

il. There' s\an evaluator in the, helicopter itself, and then there s 
another helicopter that flies abo«t 1,000 feet above with- 
another evaluator. So it's evaluated by ^t least two .people 
in flight. 

12 • iC^lot-^f missions that the UH-1 pilots perform are at twilight ( 
or dawn. One of the problems has to do with the point in ' < 
' dai^eSs that a pilot can successfully perform NOE ^missions. 
It wa^thoiight that experienced lielicopter pilots would have 
no^^iiiiiculty with NOE flying. This turned oyt not to be the 
^...ifase. Pilots trained in NOE coult perform; pilots not ^so 
trained had difficulty.. ' ^ 

13. Q: Is there any device f 6r « carryljig REALTRAIN kinds of data 
, bacic as far, as the selection level or is it only a train- 
ing' evaluation procedure and it stops there? 



A: At the moment., it is a training evaluation procedure, but 
they are working on carrying it back' to at least a 
selection level. But at the moment it's strictly a train- 
ing evaluation procedure. > 

Qr How is your skill qualification test coming, and "What do 
you estimate to be the cost per year^of operationalizlng 
it^ and managing it? ' 

A: The skill qualification testing is coming along great. 
We'll have the SC^'s in place in about a year and a half 
or two years. I have not ^veh the^ foggiest idea of what 
the cost is. V 



^6 



s 



3 




NAVY EFFORTS DJ CRITERION DEVELOPMENT FOR 

^ JOB perforM3\nce evaluation 

Frederick ^. Muckler ^ * ^ 
Navy Personnel .Research and Development Center 

^ ' a ' 

k 
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Introduction 

One nice thing about discussinrg the area of criterion development 
for job performance evaluation in the Navy is thfe multitude of available 
examples. Indeed, all of our systems applications and our R&D proj^rams 
are, without exception, infested by the criterion problem. TJius, py 
charge — wh^ch is an "Overview of US Navy ^f forte in the Critferipn ^ 
Area"— ^s in one sense a simple one. I can' state categorically that 
where we^'have a human behavior measurement program we have a criterion 
problem./ • ^ ^ 

< 'Further, in general, we adopt one of three approaches to th^ * • 
criterion problem. First, we often ignore it atid hope that som^^ow the 
solution will appear as a^natural r^||ult of doingvlthe work. Secbnd, we 
often agonize over it. The question mo6t often heard here is: "What 

^does all- this mean*?" "ftiird^ we may attempt to solve t|ie problem 
scientifically ; this is the "sound -methodology" appi\)pch whi^h assumes 
that good methods will extract acoffljfe^ble criteria.' None of these 
approaches, of cdurse, tetft to work TverV well,^ even where in many 
cases we will alternate between ^11' thret 

The basic problem, it seems to me, is that we persist in demanding 
meaning from our measurement. We want to be able to know what our job 
perforfaance measures add up to; we want to evaluate them. If we only 
did not have-^o do that — if we could only *be satisfied with the data ' 
points alone — the criterion problem would disappear. Indeed, ^ome of 
us adopt just that technique. We collect the data, publish the report, 
and leaye the meaning to somebody else. Unfortunately, we have all 
found that when^ others interpret our data the con^stent result is 
misint'erpretation and misuse. / 

Erom a host of possible topics of concern to N^vy research, I would 
like to concentrate today on three ar^as. First, we are, concerned with 
methods of generating criterioii sets;4l sMall be. co^emed with f9ur 
tbols. and the problem 9f "criteria' of criteria. " "^Second", 1. have selecj^ed 
six specific te«hnica.J problem topics with tlie criterion ^^development , 
area. And thir^, I would like to mention seven a{7plications examples 
where the criterion problem tein^ins unresolVed. ^ - • * 



So far .as I can sewe, while the areas ^reviewed and the examples 
cited are Navf-specific, all of them represent problems in criterion . . 

development fo,r any context of human performance evaluation. I do not 
see^ that the Navy has any unique problems this area. Rather, they 
are problems shared by all and, saCdly, they are problems which have had 
a persistent history in industrial and organizational .psychology 
(Gilmer, 1971; Landy & Trumbo, 1976; Smith, 1976; Thomdike , 1949)'. 

. f 

^ Generating Criterion Sets I . 

^Witlj respect to the first area — that of generating criterion 
sets — ^^I will assume that we have available some quantity of raw job ^ 
performance data: a lot or-a little, subjective ^r objective, complete « - 
or Incomplete*. Given those data, the question now is: "How do we 
evaluate, it?" Or "Wh^t does it mean?" - * 

- ■ ' ■ • -• * ■ ^ - - • 

Technically,, it seems very iliiportant"«-to me at least — to repeat 
again and' again one fundamental point: the measures ofA'^ob performance 
and the criteria on those measures are not the same 4:hirr^i Criterion 
"measures" are in fact above and beyond performance "mieaisures. " 
Performance "nfeasures" a're neither good nor bad; criterion measut-es 
make them S(^. 

^Smith (1976) has ^centj.y commented :t^ v' "The first requirement of a 
criterion is that it be relevant — to some important goal of the individual, 
the organization, or society."^ If one accepts this requirement, it 
seems apparent that criterion sets are transforms on the job performance 
ipeasure sets. These transforms must, relate to domains far beyond 
specific. job performance per se. . 



So, opr problem here is the methods by which we generate criterion 
sets which in fact will pfovide judgement, if you will,' to some other 
context. I would like to distinguish foui^ general methods , all of * ' 
wiPch can be seen in current Navy research and development. 

* ^ (1) "Traditional" sets . I doubt if there is 'any context in which we 
work with job performance measurement where tljere is not already a . 
"tradition" of past criterion set;a'. One of the major' emphases ^f many 
current Navy R&D studies is "prodiictivity" (Muckler, 1976). -.We are 
concerned with the-^lack of^-t in Navy task performance, and we are much ^ 
concerned with methods of enhancing it. The criterion may be simply 
stated as: Mo^re is better. Whatever the individual does/ he or^she 

t ■ K y 

should do pore it in the same unit of time. ^ ^ 

But in most cases, "more is not better." *^ am reminded of a 
productivity enhancei^ent program in a cigar manufacturing plant where 
individual cigar \putput pex day was increased from 3,000 to 6,000 per 
day by using all of o,ur bag of tricks in self-pacing, participative > 
.management, work incentives, and forth. Unfortunately, the sales 
manager r^tiiyied to the plant and informed mAnagement: that the plant 
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aggregate based /on 3,000 per day per worker was all the market coul^ • 

bear. The end result of 6,000 per day. was a lot of cjlgars stored Iki 

the warehouse, so- more is not necessarily better. . ' 

A second example tiuiiLetiiB Llie productivity of our training systetiis. • 
Navy programs are no exception here to the demands now being^placed on all 
training systems everywhere r We are told that we must have more and better 
training for the dollar. With respect to more training, certain traditional 
measures suggest themselve^s immediately: (1) number of students produced, 
(2> staff /student ratio, or (3) attjpition rate. We must maximize the . 
first and minimize the sec6nd and third. Unfortunately, none of these 
seemingly useful traditional measures h^s clear criterial interpretation. 
How many student^ we produce, for example^ must "be tempered by how many 
student© we place in jobs.® FurXher, ^o state that a training activity 
has attrition rate"s of 0%, or 50%, is meaningless without reference to 
other criteria. I assinne that .should. we achieve 0% attrition* we would 
then be accused of •making training too "easy." 

The difficulty' with traditional measures is that while .they may be 
incomplete, ambiguous, or even , incorrect to us, they ar'e bffen most 
".relevant"" to others. In job performance, for example, it is natural 
;that managers should ask for more productivity; they are most often judged 
on the basis of that single, "ultimate" criterion. ]We must, I think, at ^ 
least be sympathetic where "simple" criterion measure^ are commonly used.^ 

4 ^ 

(2) "Theoretical sets" |:" How delightful it would be if Ve had 
formal quantitative models where the criterial transforms would be clearly 
aria* 'liiathematically spepif led,. We would know^ what they are and how they 
are computed. Considering thS sheer amount of past work in jbb per-' 
formance ^valuation covering surely thousands of research*publications , 
it may seem strange Vhat. we do not have more formal theory. In some f^w 
Selected c^ses su6h theory is available, but even here t'he issue is not/ 
simple » 



It was my pleasure for some years to work in an area where the 
Relationship between individua^l job perfqrmance and system performance 
coxild be mathematically stated with great precision. This was the area 
of optimal control theory. Given the statement of- the system state 
* spaces and the allowable system processes, it is possible to -define 
''mathematically optimal paths. But even here the judgmental' pr6cess was ^ • 
Essential. It turns out that there is no one optimal path for any 
systeta. It depends on wliat you want. And what you wan^ depends on 
judgments that have nothing to do with CTie measures or the^mathfem^tics. 

To my knowled RikD programs working* on developing 

quantitative theoretioai ^models chat will relate our job performance / 
measures to our- criterial seta. The closest thing to'it has been connected 
with the computational problem v>r dealing with very large numbers of 
predictor and criterion variabiea The past decade has brought us both 
the mathematics and the computer cdpability to deal simultaneously with 



very large N-diinen$ional mdasure sets. At! the present time, we have 4 ^ 
program b&sed on complex 'polynomial regression equations using mini-computer 
technology specifically designed to deal with Job -performance measures. . 

pMf%h±le, these ^chnlques will allow us to handle large quantities ^ 
and*klnds .of job performance measures, thesy arq not "theory" in ^the a 
using it here. They .will 4II0W us to process coherently 
Its of Job perfoWi^nce data; but they will not tell us what 

^hat is bad . * " v * 




een ^TBt*v 



" (is) Empirical methods . To me, one of the<K)st i^iteresting develop-. 

nypp ts ov er the past decade has been the development of' empir ical m e thqds 

of deriving both criterion measures and .the weights 'that should be. ' 
assigned 'to those measures. It lseelms particularly appropriate ihexe^ that 
mention be mad^ of the, work of Ray Chrlstal and the JAN procedure (1968) " ; 
and synthiipic criterion methods (^Ullins, 1970) , With' this, technique, 
and other^ like it, the logli>* seems clear: If criterion sets require * / . 
expert Judgment, *then let -U^ systematically and empirically investigate 
^the experts. , . ^ / 

■ ■ ' ■ / ^ ' 

It would appear thajt the most popular teqhnique a^priBseht with 

Navy programs is Delphi, the procedure normally as5p<aated with Dalkey. ./^'*' 

and HelDaer '(19'63), -and the Rand Corporation. For some reason, Delphi 

has^ become extremely popular in Navy programs^. Recently, X haV^ seen 

Deflphi used in such situatiofiys as , decisioiii making, unit performance 

: measurement;, graining, tactlcil field exerciser, and the like (Sander, 

■1975; Larsqn & Sander, 1975). There is certainly something yety Qatis-^ 

fyln^ in a systematic way 'of "^collecting expert ot)inioj>' and using tfiis 

to deliver criteripn sets. The^ results always seem to me td be very^ 

interesting.^^ ' ^ - » , . 

But at the risk of seeming simple-minded or,^ worse, ^nAremptfrical^ 
S9mething ^always bothers me about^ these studies. I find n^^lf 
constantly asking the: question: '*Is this'really true?" Or, perhapS 
*bet,ter, "What is the-::pw6^bility that even, a large ^ group of experts'^ cart 
pome to the wrong concluSjL^nd no matter how GfKtefuJL.ly ttieir judgments 
are collected?" Or, anot'™| question: "Do 'subject -matter experts* 
realljlj'kiaow JWhat the proWei^^st" la short, just how much confidence 
can 1 place in the valid/ty^tl completeness of criterion sets generated 
by e^^rVs? . , . .1 

A case^il polnt?^< I suspect * that if I were to use Delphi on 
industrial sSiagers, the Result would be that the most important single 
criteHon is to maximize pro'fits. Yet studies by Stagner »and many 
others have shown very clearly that^ in fact they do not behave that way. 
Thef simply do not *behavTB as ^lanagefs to maximize profits. What th^y 
say and wffat thi^S^ do .afe not necessarily the same thing. Delphi may - 
give me ]^at they say, but is it what^ they do?^,^ % • 



\) Criteria for criteria . Last, I would like tfo turn to criteria 
for ow criteria. Those of us trained In traditional psychology, I hope, 
surely (^mnot ever forget validity and reliability ds' criteria for our 
criteria,^? But the literature of the past few years seems to me to 
raise the question of "completeness." Validity and reliability are 
surely necessaty, Uut they seem to i>e not sufficient 
• , ' ' ^ ;■ J' • . 

Let .ije quote again from Si^^h: "The first requirement of a 
criterion Is that It |^ relevant^— to some Important goal of the 
Individual, the organiiz^tlon, or NSoclet;y." Somehow I fe^l that our 
^traditional methods of demonstrating validity and reliability will be 
^ Insuffi cient to i satis fy that requirement. - ■ 



Fortunatelji^jy^^^e American Management Association Mai^gement 
Handbook (Moore, 19/0) provides a set of criteria about criteria from 
the management point of view, there are eight of these, and I would 
like to apply them to the problem df job performance evaluation. 

(1) Suitability . Are the measures relevant, ^nd do they 
support the purpose -and mission of the organization? ^ ^ 

(2) Feasibility . Are the measures theoretically attainable 
within the organization? . • 

(3) Acceptability . Will the management accept the measures 
and provide the resources to collect the measures? 

(4) Val^ue . Are these measures the beat .buy for the money? 

(5) Achlevabj^ t^^an, ^n fact, the measure s^e collected?' 

(6) Measutablllty * pan the measures be quantified In 

terms of quality, qtiantlty, t?lme, and cost? * » 

• V- , (7) Adaptability and Flexibility . Can -we change the meTasures 
to reflect changing! or^nlzatlonal environments ani management needs? 

(8) Com mitment . C Does everybody in the organization want to 

This, then> Is ^ne man^g^ment vl^w about the evaluation of/ our Job 'p^- 
formance measurement. Fratikly, considering j||pw dlfflcult/lt has been 
for us just to get Marginal validity^ and reliability for our measures, 
these additional elgHt requirements seem rather overwhelming. 

, . - ' V Somfe Current Technical Problems 

' « -r — - ~ . 

Let me now turn to the second topic area, t have selected some six 
Issues* that bother us. The lidt is by no foeans 'Exhaustive, but there 

" si . ■ - ■ " 



are problems, " as I look across Navy programs, that I really see looming 
very J.arge. ' ■. . ' ^ 

(1) Data, acquisition . First, the problem of collecting data. It 
seemd to me tffat ikth respect to job performance evalufitlon, we are 
routinely Collecting more and more data" points. For several reasons, , 
It seems a great deal> easier %o collect more and mor^ data. Indeed, it 
seems to be expected. , i* 

In a current study we are collecting data on over 50 measurement 
dimensions for the fbb performance evaluation of «onar technicians. * ■ '\ 
^nfc l^ded are cognltjLve, vigi lance,^ noncog nltive> b iographical, percept ual^ 
bioahemical, standard test, arid peer rating measures . the principle 
seems to be: If it mo vqs, measure it.^^ ^ 
'• • - < ' ■. ■ . ' 

(2) Data processing . We feel free to<^asm:et more and. more things 
because we now have available (theoretically) enormous data processing 
capability. To be sure, thanks to the computer, we can now do data pro- 
cessipg tasks that simply could not have .been done" manually a decade ago. ^ 

Tlifs is certainly true for oar studies in job performance evaluation..^ 
We can use standardized .scenarios t6 measure job performance through 
computer training modes. And, as another Study ha^ shown > some minority 
group menibers per!l|J^ do in the traditional ■evaluation 

situation. 

•• ; . # 

(3) Cost effective 'criteria . But >11 of this is not a^ small 
coBt!;^^i^t seeinas reasonable (indeed, essential) that we, ask ir all these 
additunal data poirtts -and these computers are cost-effective. I. do 
not know. I .do know the data acquisition and processing techni^iues ^we 
have been ea^loring are far more expensive than "traditional" ^pb 
P0^formance evaluation methods. ^ ^ . 

In some cases, we are introducing job -performance evaluation where 
there has Been none before. The cost comparison is particularly, 
unfortunatef: zero '.versus N-t^h^^&ands ot dollars. The expression of 
effectiveness for '^these costs is not certain. In one specific case, 
we weiNe "able to disclose certain critical skill deficiencies and 
institjU'te remedial training to eliminate those deficiencies. .Was it 
wortji it? That is difficult to say. ' : , 

(4) On-ther-job validation . On-the-job validation of job , 
performance evaluation has alwaya been- difficult . On the one hand, we 
api^ar to be getting much better 'access to the operational environment. 

"We^re doing bettefr aboard ship, and where that is not possible, we are 
bringing very sophisticated nieasurement vanl3 dofckside to the ships. 

^ . On the other iiand, there remains a J.arge core of job perforaan, 
measuresrthat we cannot validate without World War III. One increasing 

• 5s — 




trend here is the use of ful^l scale simulation of the missioh as the 

validation device. While I gee no alternative at the present time, one 

is left with the doubt that performance^ih t^ie simulator may or may not 
predict performance in combat^ 



Simple ve rsus multiple criteria . Next, na one likes simple 
measures more than I do. Yet I do not see how we cin ever expect to get 
simple criteria for a process as complex as human job performance. 
Looking only at tj^task itself and the performance associated with it, 
I have yet to see a "simple" t^sk or "simple" performance. I sincerely 
hope I am wrong. 
-« 

' I cannot pass this subject by without coiimentlng on ^^^^^^ Holy Grair 
of job^performance evaluation: The Ultjjoate Criterion , In the, litera- 
.ture, and certainly in practice, \Je c(»fltinue to hope for that single, 
final, criterion that will expre^fi^erything — whatever that may be , 
(Thomdike, 1949). But it seems to me tha? researchers at least have 
abandbned that search. Every current study of which I am aware assumes 
the need for multiple criteria. 

(6) Meapurfment versus evaluation . I^m stitl concerned,' howev^, 
with what appears to be a continuing confusion between job performance \- 
measurement and the evaluation ''of that measurement. We appi(^r to be in 
a •minor phase of, as just rftoted, radical^ expansidns in the quantities of 
data we Collect. I would predict that^ this phase will begin> to change ^ 
and that we will, in the future, Nfe^£ollecting less data. We are, I 
hope, going to become more discriminating in getting that data relevant 
to interpretation and use. * ^ 

, ■ * ■ 

Some Criteriop Application Areas 



Let ipe now turn to my last ar^a which is some of the specific 
application areas in which Navy research and development is undei? way. 
In 6ach of these cases, it appears to me increasingly that the question 
is being ask^d: "What do you want to know ?" before we decide what job 
perfotmarjce measure sets we should collect. Depending upon the use of 
what will be made of the data, it seem^ clear to me that differential 
job performance measure sets may be selected. Or, to put it another 
in each of these cases job performance evaluation is essential j 
buti the measure sets may differ depending upon the application. 
Incjldent^lly, I have yet to be able to convince many of my colleagues 
that this might be true. So let me offer it to you as a possible 
hypothesis. 

(1) Individual job performance evaluation - I have made several 
mentions about individual job performance evaluation. Let me summarize 
^s follows; We are taking much more complete maasure sets, we are doing 
much better in job performance evaluation in I^M^atiojial environments, ' 
but we have yet to demonstrate convincingly (at least to me) that we 
are cost-effective. 




(2) Unit (team) performance ey^uatlon , ^ Increasingly, our 
efforts are turning (or perhaps returning) to the importance of unit 
(tew) perfprjaance measuremiwt . A very positive sign to me #8 the 
tenl^d attempt to measure both process and outcome of team performance^ 
measurament. For some time it seemed to me that we avoided outcome 
measurement becauie it was so difficult. For example, studies of * , ' 
communication system^ stressed all sorts of JLntemal process measures 
such as frequencjr^of interaction' and .so forth, but I never knew whdt 

ipp©H6^ to the^^ss^s. In this' case, the Delphi technique appears 
io be useful ^ deriving unit perforjpance effectiveness measures 
-.arson 5t Sailder, 1975)., , . . * , 

•y ■ , V. ■ 

-(^^ Personnel subsystem readiness . Many our_ usera^ are_not 
satlaHed with evaluations of individual* job performance. ^ We. have been ■* 
getjixng increasing depands fpr some expression of the state of the ' *, 
entire personnel subsystem (Barman & Dunnette, 1974)^% We are a^ked,, for 
example, "What is tfhe personnel readiness of this &hip?" 'In short,. wh^t 
is the aggregate o^ all the people on the" ship? I would ,hot' pretend 
that' we have an answer to that question, biit we are trying ta- see what 
we can do with the question. /I, myself, am not yet convinced intellec- 
tually that it is a meaningful 
I find, it very attractive. 



questioj^, but emotionally and Intuitively, 



^rsonneJ 




J^^stem operatibnal readiness . Tb 



fflj^ 0 




move mf^ one level, 
of compl^iaty, we are' increasingly being asked to contribute to some 
representation of total system operational readiness. In terms of 
operational, readiness, for example, what does it mean when the ship 
95% taanned? Or, what "does it m^an if th^ personnel in af given rate ar^ 
pnly 75%oob profic-ient? I would not , pretend that we know how to answer 
these qii^tion^ precisely, but we atre b'eiiig asked bnbe again. AC the 
present tiine, the' method primarily ^Ln use ifi through total system • / 
inulation models performance. I hasten tgl ad(J this is modeling simu- 
ation and not' physicaL/simulation 

Selection > training > ^nd agganizationja development . In the 
ar^as of Election , training , aAd ^ff^anizational Ndfi^elopment , I find a 
number of What are to me encouraging trends. For one> the performance 
measurement seems to me to be getting fa;: more precise and hencd of much 
greater/ diagnostic value (Campbell et al . , 1974). This is praticularly 
true in traiiljtig. tJob-referenced performknce measurement seems to me ^ 
to be looking n&ich closer at the microstructure of jpb deficiencies. 
This is not for the s'^ke of measurement, but rather so that remedial 
training can be closely tailored /o the individual ' s training needs^ 
In organizatiotial development, it ^eems to me that performance 
ment is. becoming f^i^ less global and vague and far more sensi^;tVe 
the actual event s^ that occur—implex though the^ may be.. 

• (6) Productivity and^^countability . I have m^d€ previous mention 
*of the prbblem of p;?d3ifcldvity . In this case we are being asked t&v 
supply job perfprmance measurement that will serve as the basis for^ 
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productivity enhancement and individual team and organizational ^ 
accountability^ I,' for one.^am gl^d that ^e are^being* asked. We 
remember, I hope, how job performance^ measures have been misused In the 
past for these pjurposes. If only stop people from ^repeating ^pasx 
mistakes, t)ur aerylces will be of value. *^ , ^ ^ 



(7) Evaluation of R&D persoiinel '. To end on a threatening note, 
we currently have underway studies on job' performance evaluation of- ' ' 
R&D perspnnel. In a pi*ogram' called SHORTSTAMPS (or Shor6 Requirements, 
Standards, and ManpoweV Planning System) , the^ Navy ts attempting to^^' 
perform job performance evaluations on. all NaVy 3*»re personnel with ' - 
the objective pf better staffing etandards ^d use of manpower.^ Since « ' 
-R&D' personnel . are a p^rt 'of the Navy's shbre manpower requiremehts, it ' 
seemed reasonable tp management that R&D persojinel should be included. • 
I- assure ^ou that we argued vigorously against this assumption, but tp 
no avail.' Siitce we lost, we have decide/i tahej.p them.. , ' ; 

I am reminded of a statement once matle to me by a manager: "Somebody 
is going'to have. to guess, and yoiir'guess is' better than ours." J[ thin! 
he wad right. * J v / 

■•'■\' ' ' 
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side, In looking back over the past 4 or 5' yeiars ' 
I, t find, to my dismay, that at" least for thirst, 
s been.no prograin, principally or primarily, 
|the d^^^flon problem, per se. As a matter of fact, 
jhiti^^^^ would classify in that^area sort of began 
xtiquiar ^example was Dr. C^pbell's 
o/:"Organlzatlonai Effectlvene<js which 

1$ not that._we have^ riot, proposed 

hSl:^ we'^have not been able to sell - / 

^ Fulgham's distinction this maming . 
;.program£^ we have proposed have\been 
8-^^^"^°"^^ be useful but not* usable." 
3^^bJ.em for us to convince oU(^ own people^ 
^o^work in criterion .development , 

e Is not a prof^rata we have which 
a criterion problem: I'm also 
and t^alk to all our researchers and Jt 
our research — all of £hem, of courser 
ychologists^-at how many of our \ 
)gni\ze the criterion problem exisrts. And I 



researchers do not 

think if you think back,\^iV you were very careful to avoid a 
course in industrial psychology or courses in psychomefrlcs that 
you could pass through the PhD program without ever having come 
in contact with the criterion probl^. And so for those of us wKo_ 
live and die by this ^problem and who are^ fascinated by and con-^ 
cemed by it, it is a little alarming, I think, to see a researcher 
in fact embedded in an enormous criterioli ^ problem without any 
awareness whatsoever that that problem. exists. If I look across 
our programs and see what out peo]^le^0 with the criterion problem 
1 find one of three approaches being used and sometimes all three. 

They tend to work even less if you try them after the program Has 
started. 

It is ray unfortunate tendency lit discussing research, particularly 
with our research workers, to ask many questions about* their 
research. One question that I continue to ask along the line is, 
"Why are y^Mi measuring th^t^ And I've discovered that I'd better 
ask that question very carefully because frequently I get a 

response which implies, "What the h are you talking atbut?" 

Or, I frequently get a hostile response , "What ' s wrong \tLtl^. that?" 
And, of course, the answer usually is, "Lots." But I generally 
stop< asking at that point. ^ 

I think this is more thap just ^ a semantic points It seems to me ; 
that an awful, lot of the confusion in existing literature and even 



aipong ourselves would be not perhaps resolved /but would be clarified ; ' 
if we were very careful to distinguish two levels of descriptions 
Unfortunately, we've sort of settled into this muitlple regression^ • 
approach and we call/these predictoi;;Jvaxi^bles. That's all- right — ^ 
of course most of them aren't — but that '-s all right if we call 
them that. But we have gotten into the habit of calling these 
criterion variables, and maybe someone gave some of those definitions 
this morning — that's okay, there's nothing inrbng with ^&t— but it 



' of cases^ if^we would separate that into two , levels of de^pflption. 
And what are the output measures, or what is it, whdt's happening? 
I'wish^ we would go back to the noJnnal__use_ ot_the_wpt^!^^ 
I wish we would realize th^t,* in '^&ct, when we* re talking about 
/criterion measurement, as we will, wfe are talking ^iabout the standards 



explicit about it what's acceptable and what is not. It see^^J to 
me if we were very clearly distinguishing between these two leVelS 
of description, -.a lot of the confusion would clear up. If I might 
take- for an example "errors." It was my misfortune — no, I ^ 
shouldn't say that — I happened to be present by accident with the 
start of thp zero defects program./ It was really, truly accidental^ 
And what started out as a very nice ide^~the goal of zero defectsr- 
somehow got transformed into thje requirement tor zero errors. 
And because we are vague and not too ^icplicit about this, people 
began to say^ "Gee, we've got to have zero errors." I don't know 
of ^y human^ activity where you're ever going to have zero errors, 
and, what wa^ a reasonable goal is an unreasonable requirement. <^ 
But it seems to me that frequently when we take error ^measures we 
automatically assume that is good and I woiild argue to you 

that that is not necess§rily^b« And when we loolced at the errors 
tliiat existed, then the first question was,, "How do. you reduce the ^ . ^ 
errors?" And, obviously there are many yays of doing this, but 
associated with that is some' cost function. And, in many cases, 
we've found that there wa^ jio question that one could reduce the 
errors, but as t?e began a minimization function on the errors, 
that the cost of so doing increased very erratically. vSo we began 
to- get that sort of thing. I would argue to you that thfe error 
is the qutput measure, the criterion measure is really this cost- 
fund^idh. And then the question becomes much different when one 
'±s Rooking at it this way; mudh different about this sort of 
desire of having zero errors. In fact / what one then, does is 
m^ke a j^dgment and saji, "I'll accept-tljat- level of errors ^ai 
Seing/ acaeptable within m system, " (right off the ba£ tb^t makes 
you have, to define iLt—define) what level of error is tolerable)* "and j 
for that I am willing to pay V^at much.'" ^ I don't want to belabor 
this— I will , of course--but I really think it would help an awful 




that it would 





lot if we did make this distinction. I really think it would help 
a great deal. And particularly no,w where omt measure sets are 
being Imposed upon by many other than our traditional criteria 
(some of which I will get to). 

Patricia Smith, in her article (wAich Major Sellman mentioned) — 
May I call this a mini-stop now for a promotional plug oji^ the 
Dunnette handbook which I think is^one of the finest things that's 
,ever appeared for our field. 'I wish i,t had been a little lig|hiter 
and, of course, a little cheaper, but that's the way i^^goes,' 
ish ' t it ? That ' s - the » cost-functional orv it , 1^ / ~ 

Rel^evance to the individual, to the organization, and to the 
society. I don't see where any of that is contained in, say, an- 
error measurement. Indeed, it is a separate transform on those 
error measurements. So our problem here which some of us, 'at 
least l!h the Navy, are much concerned about, ie^ow do we develop 
all tljese measure sets. How do we de-telbp the output measures, 
but more than that, how do we develop the criterion transforms on 
those measures. And the answer to that isi "Very badly." There 
are four ways that I see^ that we do this sort of thing. ^The first, 
tryihg to be ^as kind" as I possibly can, is the traditional Vay. 

This reflects the Navy's almost frantic interest in productivii:y. 
Everybody Is -concerned about the productivity problem, but I 
'think we have gone beyond concern into hySteria~with good cause, 
I 'might comment. We have some rather large or^anizatioi^s in the 
Navy- that are setting new records for non-pr6duttlvity. A^a 
matter of fact, we wouldn't mind that very much ±t they stopped 
makihg'^trouble too. Sort of the optimal combination. I have a 
great deal of trouble explaining to people that they mighfr consid- 
er the possibility that -more is not better. It does not necefisarily 
imply that because we have more output that this Ig better. It ' 
seems again that there's a confusion between the-^output 4iBSf:i:iption 
and the criterion measurement judgment. . • 

We've be^n having a very interesting problem in sojne of the 
individualized self-pat:ed training programs that we have done. 
They have been extraordinarily effective. They haVe, in fact, 
produced very high quality students in the sense of the^ very 
excellent measures of their proficiency, but they have wreaked j 
havoc with our logistic' system. One student comes out in 3 weeks,"* 
and the next student cQme§, qut in mare pr feweri ;wgjek8. The 
m^npowef allocation systefe has' juisf: been thoi^ghly: and totally 
confu6ecf, Anottler .th^n,g that Le^ mentioned this morning: - the, ^ 
'goal there is 100% praffciency^ and by* God we get them there and 
then \te no longer have any variance on them^ In one particular 
case in which I'd better leave out names since it involves 
Admiral Rickover, there is a concern about the facit that we give 
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them<<4'^£ of students where they're trained to 100% pr/>ficiency 
and day, "Well, how can we discriminate between them?" Add 

we 8fy>i:?,^ou don'^ have to." And then, "No, I don't believe that." 
So wfe've got »a measure where we get everybody 100% proficient 

and. In fact, it's not acceptable to the operational people. We 
are under a great deal of pressure to, reduce attrltioir rates. 
There again, the question- is, "What's an acceptable attrition rate 
for anything?" If you donj;,t really carefully distinguish between 
these two things you sort /of automatically assume zero, attrition 
Is what want: I would argue not po. ^Zero attrition, 25% 
attrition, 50% attrition, those numbers In themselves ha ve no ____ 
e\faluation — they're neither good nor bad. It really depfends 
on what your system wants to achieve* ^ V 

It woui<f be awfully nice, I think, if we had the kind of formal 
quantitative mathematical theory wh^ch would, in fact, define and 
set both our mealsures and the transforms on them. In most cases 
we do not hav^ this. And in thoae cases where I have worked where 
we do have this, even that hasn't solved the problem. 

\So you started off this"vhole modeling' business by saying, "What 
is it in your subjective judgment that you want to have.?" Once 
having made that clear, then we can crank the Whole model out 
and we can tell you how to go the best path based on that objective. 
I don't tjiink that in xssy life time I^m going to see that kind of 

.theoretical development In our area and, in lieu of that, I suppose 
we ought to just muddle through — and I'm sure we will. I think 
it might be worthwhile to comment h»re just a little bit, if I 
might. S^t a point in ray career I had to work a great deal with ^ , ; 
mathematicians working in modem optical theory. and the mathema- 
tics are just sui)er. You can spend a whole week looking at an 
equation. It's (ihe best of all possible partial differential 
equation work and if ybu get your jollies that way, that's where 
you get them. I discovered to surprise that many of those models 
don't predict anything. No, I take that back. They predict a' lot 
of things which aren't true. In my experience in several areas of 
physical theory-:-you know that hard^ stuff we always talk about — 
a lot of their models are not correct.^ They simply are not valid: 
and It doesn't seem to bother them. In acoustical theory, I ^ . 
commmily saw the pattern where everybody set up the equations, 
there was a big computer study , predictions wete taade,. and then 
they set up a simulation that fixed it the way^dy wanted it 
to be anyway. It's int^esting that psychol^gistsl, it seems to 
me, have been extraordinarily concerned abcmt what we're doing, 
alKL-fehe quality of what we're doing, and the meaning of what we're 
doingi and I think that's very, very good. . On the other hand, it 
seems tp ms/th^ ^Yery frequently we get. upset because our problems 
a^re SCH i<^^^^^ t\i^t it seems to us it 's all unsolvable. 
k^-^l^ ke^ Vm concerned, having worked in many other theoretical 



areas, I tljink psychology's in pretty good shape. I wlstyve 
wbtildh't cry so much about it, however. V 

11. In the Delphi Application- to tactical, field exercises, t^ set of 
*\ measurable criterion dimens^^ons was,. I thought , /really quite > 

Sophisticated. ^ : 

12. - I shouldn't tell this story because it's not a very nice one. You 

recall that these techniques have one basic 'technique that was used. 
. And that technique was that \>e want to collect these data from the 
6xpe|*s independently and anonymously cause we know what h a ppen s 
when you put them all together in one room, A very recent stiiidy 
was done which I did not know about until after it was done in - ^ 
the Navy. They didn'r have time to do^hat and they had them all ' 
, together so they sat down and they did it in one room, and there ^ 
. was, in fact, a hiierarchicalvrankt, system operating. I'm also 

y J* reminded of a study I did some years agp in flight test of an 

instrtiment. We had 12, flight test pilots— from a service I will 
leave unnamed— evaluate that instalment . They sat dOwn as a 
committee to evaluate the instrument and they said, "HOW many 
are'in favor of this instrument?" The first vote w^s 11 to 1. 
The one vote was, unfortunately, the commanding officer, and* he * 
/ said, "We will now have a secoiid vote." , The second vote was 0 
to 12. I'm astonished; I thought everybody knew about that sort 
of problem. 

■ ' ' ' ■ > * »: 

13. Obviou^'y, .we're very much concemej^^th this problem fpr any 
measures that, we' take. Beyond that^l/e talfe/about other things 
like contamination ^ and deficiency. ,1 prefer to think of deficiency 
in terms of the completeness of the measure dets. How conqplete is 
your measure set to describe the phenenoma that^ you're dealing 
with— but that's another proble'm. I'd like to talk a little morfe 
about this because based on Smithes definition where the criteria 
must be relevant to the individual, or the organization, or the 

^ Society, we might ask some questions about what kliid of criteria 
c)Duld you get that would define that relevance! How can we say 
Jl^ar example, "How would the organization View our criterion , . 
' ""^ _g3»ea3urement?" "What sort of criteria woviJ.d they put on our ^ 

f.^r^ criterion?" Needless to say, that literature is not a very large 
..^S^l' ot^e, and it's sort Of like t this is a good\ thing to do, but 
^^nobody'sS been explicit about what these criteria might be. 

14». AgB^ou going to give me measures that I can do something wltli? 
'' ■^'^*'-''-}^'^^iA\-±t was interesting in this particular management handbook that 
• , i«,<:oncem was with botlj measures of noW and also measures of the ^ 
^ . ftime history 1 I thought th|A: was extremely interesting and 

. ^--extjceniely sophisticated. If we recall some of our own •literature 
" " here c(Dr. Camm has contributed about the dynamic nature of 
\ ^^-tCTla), ±% seems to me they did not .acknowledge you but it 
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seemecT to jne like it was awfully nice- there ^was cqupem about -j^ 

an understanding^f the fact that criteria are not eternally 

stable. . • I - , ' ' 

♦ 15. Of course thfe answer Is^rto, no matter what organization you haVe. 
/ ' We're engaged in ptir annual orgy of performance, appraisal a^ 
\ ' /.NPRDC, and I suspect if you were to ^sk about the commitment ' ^ • 

^ problem, that we would cease instatitarieously to do so^ This is % V 

not true everywhere. Nobody iii particular likes this sort 'of , . 
thing but they do it anyway. TJhese then are how management of 
_ _tKe organization might respond, by their criteria to our criteria, 
a ppssii>le set of criteria on. criteria. 

' ' • '■■ ^ .. ../ \^ 

16. Our users are, frankly, much mote sophisticatefd al)out this. I 

think, with i]jany of our users,* if we came in and collected one ^ ^ 

number, one* output measure, they would-be disappointed They . 
really expect us to collect large data sets. 

/ 17. This is good news and bad news. It 'Sv good news because we r^ 

collecting a lot of data, and we're collecting it of a magnUude g 
so that we can really do something with it , But ofit course Wb's 
bad news because what it- really -reflects is we don't know what ' ' 
we^re doing. Ancf we're going to* make overkill and make sure 
that we don't miss anything. An'* so we will have a lot of pseudo 
predictor variables. , - ^ . 

18. In going aboard ship, which is a game we play, we ar,e finding 

aboard those ships computers. Now they're there for other reasons. 
\* /' And^we are finding that they're not teing used all the time. And . 
/ ' we say "Hey, can. we -use those ^computers?"* And the answer Is yes. 
^ So now vrtien we come aboard we bring a terminal and software and 
0 we time share Vith the . onboard comjputers. And/ we use these :f.n 
evaluating for many, many purposes'. One is, frankly, personnel 
management. I think you would not be surprised, aboard a carrier 
with 2,700 people or 3,000 people as the case, may be— one, by the 
way, is never, suife how many are aboa^rd~I xhlnk you would not 
^ be surprisedjto know that very frequently there is less than 

optimal allocation of personnel resources. Translated,' I remembft 
» one propulsion evaluation board on* one of ou^ 'carriers—the PB ' 
set up certain standard proUlems and they fexpect people to solve 
them A In this one case, not only could they not solve' them but 
they couldn't find anybody who could. Not because he was not ^ 
there— ^he guy was there—they just couldn't find him. Then we're , / 

talking about 600 men in the Engineering Division, and just nobody 
knows where they are. So this, is a real problem. By the way I 
♦ might comment; you don^t experiment or test with the devices you 

take aboard. You plug in with the onboard computers arid we find ■ 
that, really, these are extr^cu:d4ji^K^oppof^ with respect . ;1 

to job performance measurement. So,^r example, we ca1i set up , , 
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: little standard job scenarios, have tlje folks come do^ in their 
off-duty cycle, ani i^e can measure rather iirectly their job . 
,perfonnance,f with li^ispeet to standard job stenrarios. And this is 

^wrking/ just beaut if lilly, providing the cpnnnl|anding officer likes 

• it. . : - 



i 19. This was dockside, job i)erformance evaluations j^n three ^ skilXed 

categories: sonar technician .J[of course) , weather technician, y 
and missile technician. Now. these are supposed to be the best 

• guys^e've got. They're out thftre doing their jobs; .they ^e been 
. ^j^^ schools and theyWe years of expeHej^^ 

supposed to tbe super. Jerry^,and his folks went\do\m^and tested 
. these people on sojae very sophisticated job refereime tests, and 

^ the first thing we found was some- rattier startliAg deficiencies in 

what the very best of our people could do. You l^^jj^^ybu really 

• don't want a nuclear warhead technician at J0% Effectiveness, I .[ 
think. I'm happy to say immediately -t;hat they^brougl^t with them 

» • remediair training programs iaijored. specif ically * to the individual 
so tljat the iqeasurement that they got was diagnostic and cbuld, in 
fa^t, be tfSfed^immediately by^the people.- I'lja happy to report frpm" 
the latest data that this was ext^raordinarily successful, Butr it 




^ waai extraoi;;dinarily expensive as well. And so bne^gets to the 

point of jB^ying, "Yoti've got a nucj^ear warhead t«chnici^n. Wh_^ 
- is the effectiveness of changing his job ^irdficiency from 70% to 
98%?"^ Weil, emotional!^, it makes me* feel much better. But,'is 
this the kind of data 'i:hat we can present for cost 4^fectiveiless ^ 
evaluation? I doubt it very much. W^ll, let me put jLt this way: 
It hasn't worked so far^" " . ^ ^ 

20. I don't see that this is a problem in practice. -It seems to me ^ 
that in most situations, that I'm familiar' with, I jdon't see many 
people looking for simple criteria. . In pract:^Jce they are really 
looking for multiple criteria becausd^ that ' s the natute of what ^ 

f you're dealitig with. In dealing with mathematlGians— it was ' 

always an interesting ex|>erience for me to take this kind of \ 
pifOblem to )^ mathematician. For two years I was; with some of 
the world-class mathematicians who assured me that no matte^r how i "-^ 
complex "^the problem was they would find it mathematically tractable. 
This was, of course , ' before they saw ourv problefflis. And sd we " * 
started giving seminars to the mathematicians: " We started saying, 
"Okay, here's some of our pfoblemis, noW what do we do wi<:h this 
mathematically?" I recall 6ne, ^Ruth Holliman, who's- famous for the^ 
Holliman filter, who said, "Ijhat's too complex." We used to/ have 
a little scenario in a special, bemitiful mathematical library/ 

. -- You recall Einstein*^ theory of relativi^ty rested on Riemannian 

surfaces , w4iich is th^ thec^ which h^d been developed ^bout ^/t^ 
years befoJe. So he had a mod^l that \e needed. Wp had this Little 
thing that we're going to wal^ through the ftatftematicar liBrary ^ 
and a volume would ^all;on' the floor open to Chapter 15, which, was 
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the model for our data. ♦ Thi^ was otir 'thecfry of divine Intervention- 
and It' never happened. ^ \ r 

With respect to Individual job- performance evaluation from a' 
summarized sum of the comments , I see- much more sophisticated 
measurement than I've seen > I've seen touch more ^.In-depf ti, on-the-job 
performance measurement, * and fijpnjcly something there I.like^V I see ^ 
a lot more of "objectlve"''mejft3uretoeht'; Mr. Camm noted some of 
these. We're lesp and« less dependent upon rating methods* I'to 
really, not against rating methods but I 36rt^afllke the fact 
we Ijave much more measurement opportunity in . lAdS^vidual job* ' 
.performance situations. • 
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THE CRITERION PROBLEM * ' 
OVERVIEW OF EVALUATION AND MEASUREMEl^T RESEARCH 
.IN THE AFHRL ^CHNICAL TRAINING DIVISION 

Philip J. DeLeo and Brian K, Waters 

Technical Training Division 
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; Lowry AFB, Colorado 



The Nature of the Criterion Problem in Technical Training 
^' ' * 

People engaged ' 4:h training res^arch^^equently view the well-known 
, criterion problem f rc^m ,a somewhat dif fcr0nt:j|||erspective -than those who, 
" perform selection or. classification studdei?^ The typical Selection 
study. begins with a careful search for critferia which possess, among 
othe(r desirable properties, (a) relevan^ie to the ultimate criterion, 
(b) freTedoth frpm contamination, and (c) reliability (thomdlke , ' 1949) 




f measur- . 
i^redict 
ly to 



Selection ahcg^fej^s s i f 1 c^tl'ofl^e^^|te^e r s then deyise;-inethoc 
ing behaviors (i^^ • 4biJU-.^a^^ test^^^^tp^MRt^ 
the cxiterionv chosen. In '^^Hjjpls^^r^fiiSg re^sfearchers ai 
accept the criterion pbiecti^B of^ training course, or uni 
itistruction , as "givens^and bypass thax aspect of the criterion problem 
completely, choosing instead to con^p^ntrate ^at Is 'Essentially a 
measurement problem, 'namely making the mastery or hOn-mastery decision* 
on specified criterion objectives. .Thus, in both, the. knowledge and 
performance domains, the criterion problem becomes- a questiorf of whether 
or jiot mastery of the criterion is the state of nature for a certain 
individual. Relying on the instructional system development (ISD) 
process to specify appropriate criterion objectives, training" researcher 
have tended to concentrate their energies on developing methods ^ for 
measuring whether thfese criterion objectives have, indeed been attained. 
This strong emph^ib&^n measurement will be seen clearly when, we discuss 
our past efforts, ^nd it continues prominently in our present and planne 
work. ' i 



Having contrasted selection and training approaches 
problem, let us ndw attempt to show how they are related 
illustrates the linkages between selection, training, and 
terms -of immediate, intermediate, and ultimate criteria. 




Most,^f not all, Armed Forces selection and classification tfests 
are validated usiTig performance in training as the criterion— for the 
obvious reasons tlijat training data are easier to obtain, 'less costly, 
relatively reTlable, etc. But, it is clear that only to the extent that 
training performance is Lruly reflective of job performance- are ' A-'^ 
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Figure 1. A model of the relatit)nship bqj^een selection and the 
ultimate cri'terion.* ' 

selection studies on safe ground. For the process described^ in Figure 
3,\^ bs valid, it is irtciimbent on training res^arcR^rs, therefore, to 
re-examine a more classical statement of the criterion problem and 
consider to what extent training performance actually predicts job per- 
formance. Wh?Xe accurate measurement of trainijig performance is seeij 
as a necessary conflition for total system effectiveness, it is by ' 
itself not sufficient:. Realizing this, we iiave increased, our emphasis 
oils improving ttaining\ivaluation (^tep 5 'of the ISD process), andwe 
;^]>io_j:he future conduct reseytfrch to improve the methods by which 
both training requirements and training objectives are developed in- 
Air Force- training. ^ (Step^ ^ and 3, rfespecCiv^ly; o5. the ISD. process y 

To recapitulate^ ttius tar we have asserted^hat **sol 
criterion probl^ involves answerlog essentially two questions: 
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(a) what behavidrs^should ^be observed- (measured, tested) and (b) 
are these behaviors to be measured effectively (i.e., taking itito 
account reliability^ of the measuring? devices^,' efficiency , and accuracy)? 
The decision to observe certain' behaviors rather than others involves ^ 
a content validity approach which Is based on defining the job domain 
in terms 'of tasks performed. This aspect will be subsequently referred 
to as tihe definition aspect of the criterion problem. The xjuestton of ' 
measurement effectiveness equates to^-a predictive, or con<furrent, 
validity approach which relates 'training performance to job perfpifTnance-. 



Table 1 provides a <:omplete overview of ouf measurement/evaluatiou 

research work as related to these tk/o aspects of the criterion prob- 

leml We shall next review thes.S" studies in some detail, indicating 
'^'general tr^ds in oifr program . 

Table 1. ThACriterion Problem 



Past 



Measuremeht Aspect » 



De f in i t ^piy^Aspe c t 



o Student Attitudes 

o ' Confidence Testir%g 

o Advanced Measurement Techniques 

o Adaptive Testing 



7^ 

Survey ofATC measurement/ 
evaluat;ion procedures 
Task clust^L^ir^ in^ 
Jfield evaluation 



Present 



o ^dap.tive Testing Model 

Deveb-opmeiTt ^ * 
o Symbolic Performahoe Testing 
o Critarion Checklist Reliability 



^ Advanced' Field Evaluation 
'System " ^ r 



Future 

\ . 



o Latent TraiX Applications, 
o Adaptive Testing Iioplementation 
o Criterion Referenced Testing 
(Mastery /Non-Mastery) 



Previous Work 



o Requirements Validation 

o Workshop for Implementation 

of Advanced Field Evaluation 

System 
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Since ^he Technical Training Division of AFHRL was originated in 
1969^, the^primary thrust of our measurement and evaluation research 
program has been directed toward the measurement aspect cfcf the criterion 
problem. Resource^ conmiiLted to ttiis task haVe been quite ^limi^ted , 
due primarily to' other commitments within the Division such as develop- 
ment of the Advanced liist ructioual System. Rarely has more than one 
man-year been di^oteJ lu measuremen L /e valuat ion . Within these const rain(,s , 
we ha\^e Lrlred to be rtisponsi ve to the immediate needs of tidhe Air F^rce 
as well as to ^nvesiigate new techniques for incorporat*ion into com- 
puter based instructipnal sysLems. % 



Duf'ing the 1969-1972 time period, problems of measurJLn/^studeiilb 
attitude and student achievement oc^pi^ our attention, Tl^e attitude < 
measurement project atteji?)ted to devel o^jjj ^is/ft^X^^^t Cri-tiqtie Form ^ 
for potential ATC usage', A serieiS aJ-^-^^^ts (7, S.^^^'^^^as issued- 
covering the development ol^the critique scales, the fo^feafeion of norm 
groups, scale reliability, factor analysis of the ^^estiohnaite^ < ^rtd^ 
juse of the dtscriminant function to support item validity. The norm^^^^^ 
refeTenced approach described by our researchers in the .final repart. . 
<12) wa:s'Jtidged , by ATC personnel to be operationally infea^ibie'T 
consequently, l:he newly developed critique "^orm was nev^r used, ^ 

In the achievement domain, we investigated the utility of confidence 
testing in ,an Air Forc^ environment (2, 3, 4, 5), Confidence testi(|^ 
is a technique for rest scoring, where students are asked to express 
the degre'e of confidence they have in their answer. Confidence testing 
could increase the predictive validity of test scores in two ways: 
(a) by making constructive use of partial knowledge in determining an 
examinee's true score,* and (b) by reducing ,test anxi^y. Of the avail- 
able techniques for allocating confidence, two method* were' studied ixi, 
the Qlassroom (6), Neither proved superior, and the students i/ere ^ 
'relatively indifferent to use of eithers^technigue. The most serious 
objection came from instructors who felt that the system was too 
geompl-ex to score by hand. However, the results of this study may one 
^ay be applied through incorporation into a computer scoring routine, 

®* By 1972, we had turned our attention to finding alternatives to 
the multiple choice format for testing the knowledge domain and to t,he 
development of more sensitive scoring systems (1, 14), This effort 
culminated io a study by Sie-gel et al. (15) in "which* several advanced- 
measurement techniques were tried in a classroom setting. Included 
were novel item formats such as analogies, pictorial testing, ^d 
cognition of figural systems as well as new scoring methods such as 
confidepc-e testing,>sequential testing, .and theory of signal xietection, 
Though'ttiese techniques were., on the whole, ^ccessfully demonstrated, 
in the study ^-^Ttref were not adapted on a wide scale, probably because 
ATC first-line evaluation personnel were not trained in their use, ^ 

In search of more efficient ways of measuring an examinee's 
kno,wledge and ^kiy-s, we initiated wprk^^n adaptive pr "tailored'' 
testing. Here, a reduced set of items fs given to an examinee, 
dependent on 'his or ri'er ^previous pattern of responses,' f)ur initial 
efforts in adaptive testing were to consider the issues involved in 
implementiHgtthis technique in a computer based training system (10), 
Waters (17) also conducted an empirical investigation of one approach— 
the Istradaptive model, for raeasurj^Lng ability—and concluded that the , 
i^del hel^ prqikiise. ^ 

Hansen et al . {l^) st^ces^^tully Implemented two adafptive testing 
algorithms, FleHile vet and Hierarchical, in'the Precision Measuring 
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Equipment Specialist course at L6/W^ ResuLtisj :^roin t-hl^s- study are 
,^decidedly encouraging/ Time saving^ appro^maced 20%, and accui>a!cy of ^ 
measurement/^aa nearly identical to" cQnventiojial procedures. 

A 1974 study (16), which surveye^d AfC measurement/evaluation 
. procedures In the context of thfe ISD modefl, developed some information' 
which laid the groundw^k for our current, interest in the definition 
aspect of the crit;erion 'problem. An in-house follow-pn st|ady (13) 
appraised the ATC graduate evaluation system, presented a method for 
dete Alining over- and under-training, and suggested a task clustering 
approach to linking job performance wjth training object-iyes. 



Present Work 



w Work on adaptive testing has been undertaken primarily to decrease 
test time. In a well described instructional sequencespn^^requent 
measurement yields assurance that the student has attaihda|hperequisite 
Dasic concepts and skills before proceeding to more coiiq>lex areas in 
the currfculum. However, no single model or algorithm fqr adaptive 
testing has a clear lead at this time, nor are any ready^^-^g^r widespread 
implementation. More work needs to be done particularly isi the 
theoretical development of adaptive criterion referenced performance 
test^ Consequently, we are participating in an interservice project 
which is supporting work in this area by Dr. David Weiss at the 
University of Minnesota. Another basic research contract with the 
same general objective, although with a somewhat different approach, 
is also being supported. - 

■ # 

Development of an Adv&nced Field 'Evaluation System for ATC 
represents our first real attempt to validate the link between train- 
ing perforlnance and job performance and addresses the criterion 
definition aspect. While the primary purpose of the research is to 
provide more useful information about training adequacy, a by-product 
of this^study will be a .direct check, independent of the occupational . 
survey repoji;^, on whether tasks trained are actually perfotlned on the^ 
Job. Hpp^^ully, as well, ijjhere will emerge a more sensitive scdle or 
measure of job performance. Another procedure that we have investigated 
for increasing testing efficiency is called symbolic performance test- 
ing. The underlying concept in this technique is to capture the 
essential features of a performance test in either a paper-and-pencil 
mode or by meaqrs of audiovisual or computer graphic presentation. 

Thus, w$ admiuiaLcii a oyiuboli. vciaion, oi analog, which correlates 
very highly with the act. lai - performance test^ in the process, we 
avoid consuming instiuctof and equipment time tor test purposes and 
can deal with more than one or two studexxts ai a time. We are currently 
working on a demonstration of iliis technique in ai\ electronics training 
course at Lowry AFB. Previous, work on symboiic performance teating has 
not been' very encouraging, Nev/ertlioie^a , the potential increase, in 



testing efficiency majkesVcontinued exploration of symbolic performance 
testing worthwhile- V ^ ^tfc j^* 

Since the advent of criterion referenced measurement, one of the^ 
major tools used by 'the ATC instructor has been the criterion checklist, 
Because accarate measurement requires reliability as a precondition, 
we are cigrrently investigatin^gjv fhe reliability of this device in two 
ATC cotti^^tt/ ^ We hope to be abjifr^o. suggest operational practices which 
Would incTease the reliability err measurement from use of criterion ^'^^ 
checklists, --s a ' 

Future Research . \ * 

Requirements validation, referred to in table 1, is meant to ^ 
encompass research td** 'ensure that training objectives flow from job 
requirements. ~ W6 would agree that ^ome theory, coAcepts, skills, or 
abilities should- be taught, even though these do npf" appear to be job 
requireiaetit^ per se . The ot)ject here would be to discover* better, ways 
of jud'ging" which enabling objectives are prerequisites to job perfor- 
mance and which are irrelevant. The student himself may be a fruitful,* 
but often overlooked source of ideas, and so we are led full circle / 
back to student critiques as a method for developing this information, ^ 

Returning to the measurement aspect, we ^intend to pursue applica- 
tions of latent trait theory to ATC measur^fenent problems. Latent 
trait theory is a relatively new approach tp measurement. Popularized 
by Lord (1952, 1953a, 1953b), laffent trait theory has the potential to 
' help sol^^e many criterion-related measurement problems, Hambleton 
et al, (1977) cite the disadvantages of classical measurement/ 
procedures; among these are sample scientific item parameter 4stiimtes, 
and the fact that they have no* utility in determining how a given^ 
exam4.nee Will perform on a particular item or set of items^-; Laten^ . - 
^ralt theory permits us to predict item performance by individual' ' * 

examinees based upon the underlying trait or characteristic being v • 

^asured. It does- not rely on the classical "standard^rror^ o£ measure-, 
me\t" which assumes that examinees of all ability levera have equal . ^ 

errors of tneasurement . ■ Latept trait theory also allowS^us to get" away 
fropthe concept of using only groups correlation coefficients between 
predictor and criterion test scores to determine the utility^ of a 
mejisuremerit procel^ure. The concept of "test information function" ^ 
allows us to compare different instruments or procedures in terms of 
the relative amo-tfrit of information which they produce about the^jindA:- 
lying trait . 

A relatively largk and growing aiuouut j^f research has been done 
in the use of latertt ttait parameter estiMtes for selection and classi- 
fication. Aside from the on-going work by\ Weiss and his associates at 
the University m MinnesoLa, practically nJ research has been done in 
an instructional environment. We plan to iook at such applications in 
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the near future, probably in our' FY 79 program. 

< 

• One of our major concerns is the effect of having a multi- I 
dimensional instructional situation as opposed to a 'relatively \my-r 
dimensional aptitude measurement problem. As current latent trait 
models are defined, -a uni-dimensional latent trait is assumed. We 
^must either examine the robustriess- of existing models to violatiory of • 
this assumption or create new, more complex, models which can handle 
multi-dimension,al data. If one of these alternatives proves fruitful, 
many of the scaling, sampling,' and lack of individual predictive / 
problems with conventional criterion jfredictiohs may be easefl, * . 

Continued Vork on adaptive testing is a cVeax future direction. 
We would propose to implement those models which survive our :jj^re sent 
studies and meet the tests of practicality, ea^e^. of use, efficiency, 
and accuracy. ' , , 

> 

Still a third aspect to the criterion problem, not considered so 
far* the question of how to set ,cutoff scores on test instruments. \ 
What srafcionale should be used for deciding mastery leveJ?? A further 
factor is the utility or cb6t|^ testing. ^ Is the information gained 
from the test worth the. co^t of administration? If cost were taken 
intp account, perhaps we* would conclude that the test ought not to be 
given at all I A decision theory approach (Cronbach & Gleser, 1965) , 
based, on the notion of utility may be fruitful^for investigating these 
two additional aspe^cts: '• ' * . ^ , • ^ 

4 ' ' ,' ^ ■ ■ 

Figure 2 is a graphic representation^ of the problem one faces in 
setting cutting scores on a test. In a roughly nonnal distribution of 
test scores, students tend to fall into three discernible groj^ps: { 
masters, non-masters, and a fairly large middle group which lhas test 
scores between Qti and One would need to collect* more information 

about this middle gtoup td render an effective mastery dec;L*&ion. This 
may be uneconoi^ical in certain in^tai^ces. If C2 is. chosen as the Cutti^ng 
*s^ore, the shaded arjea represents errors of classification 0t the task. 
With C2 as the cutting ^core , false positives are quite sm^ll and false 
negatives ^latively large; the opposite is true if'-C^ is cho^s^n'. ^ 

, A decision-theoretic way oft thinking may be helpful in setting the, 
cutting score, fhe impo^;tance of the decision being made dict'ates , 
whether C-^ or C2 is the most beneficial place for the cutting score. 
If the consequences of task success and failure can be quantified in a 
dollar metfic, the costs of additional testing could be combined for 
various cutting levels, and a more rational decision Ifeould beViade. 
These questions remain to be e^loted in future rksearch. 
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Figure 2. Cutoff scores and decision errors^ 




In suipmary, *it should be emphasized that for the training community 
progress on the criterion problem will come when both the measurement 
and definition aspects have been addtessed/ We have shbwn to* what ^xtent 
our program has been concerned with these issues. More work needs- to be 
done on the definition aspect in order to assure ourselves that job 
relevant behaviors ar^ ^eing trained In'an effective manner; 

.-^ Much of what has been presented in this paper is clearly applied, 
even "actidn-orieated research. That is, knowiW:echniques aife applied 
to solve operajribnal problems. We have a strong' K.as in this. 4i^ctipn 
and 'fe^l that such is a proper- orientation for a military R&D organization.^ 
Hbwever, so^e emphasis un theoreticVl development^Jldvances in statis- 

\ tical «ethodology,' and innovation inNsineasurement ffechniques will be . 
dAntalned. We mua^ continile ta support and encourage basic research 
so thit 'net tools will be available to> solve problems yet unstated, 

w ■ ^ \ , • . ' : s ■ ; 

We hope to li:ivc= Jk;arued some Ipfesons in oulf 8-year existence. . 
Many of thest aie not researcft lessoni but guidelines for translating , 
our research Into'joper^tional • programs . We must constantly be alert 
, ft>t closer coordination with our.useft, not only to be responsive to 
^'"real heeds, but also< to help with thefpLO^em of personnel turnover and 
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changing perceptions'^ of needs. • In,<^dd:j.tion, we must provide'transition 
plans to include supports^ and training where needed so that improver- 
ments may be institutionalized, for instutionalization of our research 
must Be the ov^riding goal of our measuT'ei^ent /evaluation program, 
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OVERVIEW OF. ABVANCED SYSTEMS ifo^'oN 
CRITERION RESEARCH (MINTENARCE)' 



^ John P. 'Foley., Jr. 

r " ~Advaaeed .Systems Division 

Wright-Patterson -Air Force Base, Ohio 



Introduct 



The- Advanced Sysj|j^ Division (AS) of Air Force Human Resources 
Laboratory XAFHRL) has had two- separate and jdistinct criterion R&D 
programs — one concerning pilot performance ; and the (Other concerning ^ 
maintenance performanoB, Today I am addressing our maintejiance program.' 

... Maintenance of hardware is curi^ently an extreqiely cogtly operation' 

H^H^ for'tl>e Department of Defense (DoD) • High maintenance cost is the 
^-^s>^^^r4mary ca,use of high systems ownership cost. For some* electronic ^ 
maintenance specialties, nealrly 1 ye2r of broad formal training is - 
giVfcn. first enlistment pefsonhel^^ And maintenance training' generally . 
• is long, and costly. Even with such lengthly training, thj^ efficiency 
^ of maintenance ^ould be gtedtly improved. Improved job instructions 
and information, as well as increased use of job (task) oriented 
training have great potential for decreasing maintenance training 
time^ and .improving the job performance of maintenance tasks. 

, But, to maxlmlrici such potential and ta ensure more efficient 
maintenance, the criteria for the selection, t.rairiing, assignment, and 
prt>motion of maintenance mellj'should be the demonstrated ability of 
maintenance persoy^nel to perform the tasks of their jobs. To enforce 
such c-titeria, the key job taske nius,t be identified and the ability to 
^ perform identified .tafks -must bd ascertained- Sinc^" the ability to 

"'*i)erforTir many or most of the identified tasks will not'be part of the 
nortal repertoire of thpse, being selected, for jobs, approp^ate action . , , 
must 6l taken to develop the ability to pferfonn job tasks. "Of course, ,^ 
these 'actions are ''easier said than done . ' ^ 



The CriCc-jiiwu I'loblem 



If we can produce a uu-^a^ui inj^ aev/ic.e thai acHiuiiy meaau^ca xhe \ . 
ability to perform the desired beh.iviora liider all the desired c4Bp.tions, 
we have an, ultimate criterion, measure. But the fact tliat we usUally 
cannot develop such a* device forces us to settle for a secondary 
criterion measure wliicli is, at best, somewhat different than the* 
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ultimate. As we see it^ this aifference between the real world and 
the simulation of the real world, fpr testing purposes the criterion 
prroblem. ' . * * ^ \ % ^ ' ■ 

A common example of such a criterion problem presents itself when 
we' attempt to me^asure an individual's ability to drcLve automobiles. 
To 'measure such ability completely, we wpuld have to devise a test 
that would measure his ability to perfdrm a^ driving tasks of all 
automobiles, on all types of roads, in all traffic conditions, under 
all types of weathe.r conditions, whether he is being observed 'or not. 
It is pl)vious that it .would *be virtually impossible to meet ^11 of 
these conditions, undel: practical testing conditions, therefore,, 
settle for a less rigor^4M5 test criterion. We assume ttSfe he can 
drive any. automobile ^adeqliatgly, if/ he demonjstrates in a> performance. 
test that he can perform iQost driving tasks' in one automobile, in 
natmai traffic, while being obs-erved. * ) - 

;^t many times,' it is inconvenient and considered too costly to 
adiinistfer even such a driver performance test, and an attempt is< 
iflade to develop a paper-rand-pencil tesf^hich will determine that an 
-individual c&n drive adequately. But such a test cannot be cojisidere^ 
to be a' valid substitute unless a high empirical relationship fco the 
criterion measure can be demonstrated. In ^he practical worM^of test; 
developnftlitV. the driver performance test would be considered'* an 
adequate, near ultimate criterion^'test for validatipn'of such^ d paper- 
and-j5encil „ substitute. Many times, such a paper-^d-pencil test is 
used without beingvvalidated against such a nealCpultimate crite'rion 
te^t. The use of such an. unv^lidated test would ht^ ap, extremely 
dan'genous practice,, since it/, ip assumed , by most users-that i^ measures 
an individual's ability' to dtive , when in fact, we .are hot ^ure iwhat ' 
it is measuring. ' ^ - • 

. 'This criterion problem has long plagued measuranent theorists and 
practitioner^ , as^we^ll as JciurriGulum researchers.'- Thei use- of job 
tasks, and performance examinations based, on these taisks' as near 
ultimate xriteria for evaluation ofc. selection devices, was first 
emphasized as a result of the work of Army and Navy fetedSorement 
psychologists during World War II. In 1946, Jenkins discussed the 
problem in light of the^xperijencejs of, Navy psyclgd,ogis^ByLn an article 
in the American Psychologist , entitled *'Valiciity for what^' \ i v. . 

PtsyCliol oftlSLfci in gerieral, ten4ed *to accept ,the tacit assumption > 

rtiat QriCeria weie eit fie i "gi vert of God or just to be ^ound * 

lying about. The novice of 1940, sea<?:ching through manj^ • \ 

textljook^ atul mai. h ioiiLiial liieratUrti, would ha VQ,* been led , 

• to conclude Li.ai t: pt:dlency dii.Laied the choice of 'criteria 

<» ' * ^ 

and that thu <;oii eniciiiL availability &f a critfctibn was, more 

fmpv/itaiu tht«ii iia adeq\ia*.y ■ , • , i 



In 1964, the late kairts^ Wallace .presented a paper at the anijual fconven- 
tlfon of Amerijcan JPsychoIogical ^Vssociation (APA\ which also appeared 
In the American Psychologist (Wallace, 1965a).. It! ^indicated, that muqh 
of what J,«^ns said in 1946 was *8till true. ' ' • 



Ih the '18 years whi'chTKave followed, we have become wiser and 
sadder ab^t the criterion problem. If we h^ve not > 
(^ccpmplished a g^;;^at deal, if we tend to use the expedient • 
crltteridn with the cpmforting thoAigh^t that some day we 
wili get .down' to constructing better ones, if we concentrate ' ^ 
•oh criteria that .are predictable, rather than appropriate, 
ife do operate wltl^s^tying levels of guilt feejjPLjags.^ ^ 
have not; dojie in^j!l;^'abou,t it, but- we kn'6w we should. . ' 

In' 1965, Wall-ace' ptf^fe ea«it <^fiOther i^^per ia Which he. addressed the~ 
Criterion problem vef)^ sVcol^^W' as it. applies to electronic-maintenance 

All of this is prelil^^e 'to my- main thesis which is in no sense : 
revolutionary, original,, dt. controversial. I state it because ' 
J[|uj^' it. i,a honored in .tjie^ breach, ft is ' that the natur^ja of our ^ 
Jyi^^raficiency measurea (determines how we select, classify, train. 




intai'h, arid asfeees our human resg^rces., If^the mfedsutes ^ ♦ 
^ 'largely irrelevant tO'tbe jobs wetwan): done, we will selfea^ - 
wrong mpn, clasi^ify theiA. incorrectly^, and trijiin thepa Vftong.' 
j[> is true because these pirofiiiency measures aije , .or ^l^ould * 
fe^ tihe crit;er±a against whi^h we validate-our select ion \)?nd 
^ class! ficfation procedures and evaluate our training content * . 
and •me'thodolo^y-tor our aOpervisory techniques. Thu^, if .1 
iise^a test 'Of . adVanced/electrontCs theory as the prbficiency 
i»6adure^ for electyronias^jmaintendfice and as the criterion 
agaiqet^ which t4^,^eydli^:e a iest; for selecting men t» g^tntq ^ 
/ihaint^naTice training, ' i viill end up choosing a selectioiRtest t 
'\ which rejects men whp are Aot well ^ above ^average in both 
,i^ek^dirigj4nd arithmetic 'ability. In the process, I'kLght reject 
;a*gr&^tjmariy who are outstanding in t:heir ability to get their 
liands oh ^-plece of • machinery and make ft work. I might 'Also 
■^ccept^a number who (like myself) ace so lacking in the siinplest 
manipulative ability that, their hartdS cToul^d have bep cut Ojf.f 
at the wrists at Wrth*w;Lth6Ut sifetJ.ously affecting their outputs. 
So J when 1 decided what prdf^ci^ndy measures ^toyuse , I also . / 
: dedide^d whaf kind o^^n i>as going. t(I| -psf^injfo training foj 
tile job. . 
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' But I't doe'do't end tjiere^ For when I now atnUkpacri the proljlem- 
o^ how\o .train njd^ tcjt perform the tasks /inyplved in the job, 
I must. make decisidns^bout .what Should be taught and what 

• inethods should^ l)e ifs^* in teaching it . The qnly way t have \ 

of reaching such decisions (e^^cept by dlvinatibn which is, 
. "^admittedly, not a rare procedure) is to ioaeasure ar|^ compar^ ■:' 



the performance achieved .jrlth yarious curricula and 
miethodologies. ^^o, dn^tSS» the electronics . 

* maintenanee course,, I in jfc reading about 
electronics theory and r^^V^^jJc graduates 
read' and write 'elecX:ronibs thfeory j/hile tjWir equip- 
inent deteriorates in hopeless InfeioL rat i v^^fe sa (Wallace,' 
1965b, p. 4).^ y^"^ 
• ■ ' ^ 'i ' •'. ■" ■• 

Influenqe^d in. part by the' ato^e statemeilt: ,.*.We at the Advanced 
Systems Division decided to do apmething about the (jriterion problem 

* as it appj^ied to maintenance. And, aJjPhijugh our work was at times 
delayed apd sidetracked, 12 years iatlr'^we do have ^ome R&D coiipleted 
which*^ we .ean talk about. However, the grim and vivid picture that 
Rains Wallace painted in .1965 °ls still Xrue for most of the operational./ 
Air Force. . A' ' - 




Our approach to the criterion problem has been to study anfi 
analy?:e ^oth measurement literature and maintenance jobs, and to 
develop job .task pe»rfoi:piance tests (JTPT) for key maintenance tasks 
'Which wer^ selected on the basis of these analyses. We jlevelc^d 
these JTPT to be as near to ultimate jo*b criteria as possible Iji keep- 
ing vith the 'following suggestion of Frederiksen: # 

' . ' . > V - 

The objective, presumably , is tq get as clbse as ij^ feasible* 
to the ultimate criterion; but as has just bedn s^n; wheni 
. one gets too qlose to the^ real '-llft^ -situation, control of V 

tlie conditions for adequate obseVv'at.ion is lost. ^Observation * 
of real-life behavior is ordinarily not'^'a suital^i^ IJechnique • 
for-^tneasurement J The tj/pe ^f measure that. Is xi^jfbmm^nded * 
. for fi-fst cons^-deratioit in a training evaluation studiy is 
the type which m6s,t closely approximates the Teal-life 
..^Situation, that which, in thi& chapter, has been called' 
'^'^^icitihg lifel^ike behavior . If .it^ is not feasible to wa 
•^♦^ o»r 'l:h^ behavior to happien in real iife, tljen lifelike 
y occasions ^can be provided for^the behavior to occur^^n a 
te^st^ situation (Frederilcsen, 1962, p«^ 334). . / 




Admittedly, an examinatiqri- made tip^of tasks removed'tfrom their 
acttial j6b^ environment is nof ah ultimatte t^^eriQiit^^tW Under . ' 

;actual job situ£ttions, the graduatj| may .have to perform these 'tasks 
in cramped^^ijiarters; under stressOT l^f timfe, noise, heat\. or 7old; or 

^with ^^e^Jt^(^ boss interfering. ^^These conditions ofj^jH^s^s are 
usually nS| cohstant variables, but change from day-to^fSy and frjoih J ' 
hour-to-hoijr* The assumption usually has to' be ,made -that the individual 
can perform' a task sunder conditions ^c^£ 'stress, provided he caii perform^ * 
the salne task well under normal conditions.. A fotmal perforpaance 
examination haiB its own, set of stresses , which' may not. b^ ^i^hie; satni^^^ • 

'job stresses," but' their "presei^ee -may t^nd to offset the l^ack of jot) 
stresses'. . Formd4 job t^^ performance examinations are thi^ Closest 
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ugable ^simulation ^((||^HK maintenance jobs presently available^ 
They are -far b^tt^r thaSfflfper^ormancp tests at all. ^ 

" . Review of ^'Performa nce Measurement (PM) Litefature 

In regard lo th^ litertfture reviews and analyses paade for PM " 
(Foley, 1967, 1 9 74 >.^jnany valuable PM efforts Viave been reported by 
the Army, Navy, ^d A3^r ForQe. ^J^owevef, most of these efforts have ' 
not been- systematic e^foTts^'^havitig as their prime objective tlie ^ 
improvement of the 8tat;£-^-tli!47afjf of PM.. Rather, they have been ad 
hoc PM developmentSMp^-^uppWlj^Job of training^ research programs. 

A flfetable exception- was% the w6^ of the Mr Force Personnel ^and Training 
Center -(AFPTRC) Mainter^a^e Laboratory. \^other more recent systematic 
Army ^effort, accomplished by the Human Resources Research Organization 
(HumRRO) tCras not covered An these reviews (Vineberg, Taylor, & Caylor, 
1970a, 1970b'; Vitieberg & Tajlot, 1972a, 1972b)).. As to civilian R&D, 
during* the initial PM' literature review (Foley , .1967) , a serious 
attempt was Inade to identify and include the results of PM R&D from 
.the civilian vocational education establishment. None was found. 

A substantial outconie of the review of other PM efforts was a 
consolidation (ff research results concerning the correlations between , 
results of PM various maintenance tasks and paper-and-pencil theory 
tests,' job knpwledge tests, and school marks. As tp their value for 
measuring ability to perfbirm. maintenance tasks, this research evidence 
gives a low ratin* to all W'-these paper-and-pencil basred measures of 
school and -job su6c^s. Table 1 sHows. correlations that ^|P[Ve beeii 
obtained by coynparing JTPT W.thfeory -tests , and to- job knowledge tests. 
The latter two' are pdper-and^pencil tests . Table 1 also includes 
correlations of JTPT with school marks. As indicated earlier school y 
iJiarJcs have been heavily weighted with the .paper-and-^pencil^«st 
scores. An examination of this table indi<:ates -that the Q^rrelations , 
of JTPT scored 'With theory test scores are generally 'somewhat Ipwer 
than with job knowledge tests.' None of these mea^res ia^ff^ciently . . 
valid for use as ^ul^stitutes for JTPT '(Foley, 1967, 197,4>^ r . , 

The personnel System, which includes- forinal training, depends ^ 
almost exclusively on such^aper-and-pencil tests for making initial 
•selection, for ascertaining effec,tlveTie6s»(^ training, and for the 
promotion of maintenance pdrsonn^l. ^The effectiveness of formal 
training fp^r the mechanical mainten^ncfe^ specialties is measured mainly^^ 
by scores obtained from 'such » paper^ahd-rpehcil j6^ -knowledge tests, ^ 
' even -though the students in these trainings progf^ms have received at 
least some "h^nds-9ti" practice on^many mechanical maintenance tasks. 
The measiTres/of effectiv^ss of forni^L triining programs -^^r the . ' ; 
electronic maifttenance specialties include scores ifrom paper-and-jpericTil 
,job knowledge, tests,, as well' as theory tests. Students in these i. 
elettronic.iqaintenartce courses receive little if any Vhands-^on'V\ 
pra«:<tice In^their 'maintenance tasks.. - ' v V 



Table 1, Correlations Between Job-Ta^ Performance* Tests and^Thfeory 



Tests, Job Knowledge Testsr and School Marks 



aak 
tsT 



Researchers 


Type of Job'Tasle*' 
. Performance Tests (fTPT) 


. Theory 
Tests 


Job Knowl- 
edge Tests 


School 
Maifks ' 


Anderson 

(1962a), ' , 


Test Equipment JTPT ' 






-18-, 33 



. Evans j^nd 

Mackie et al. 
(1953) 


Troubleshooting JTPT ' .24 
Troubleshooting JTPT 


& .36 
.38 . . 

J* 


.12 & .10 


.3% 
.39 


Saupe (1955) 


Troubleshooting JTPy 




.55 


.55 


BrQwn et al. 
(1959) 

✓ 


Troubleshooting JTPT'^ 
. Test Equipment .JTf»T 
Alignment JTPT 
Repair, Skills^ JTPT 


.40 


•29 : ■ 
.28 
,.f9 




Williams and 
. Whltmore 
(1959) 


Troubleshooting JTPT 
' (Inexperienc'ed subjects) 
[a* /Experienced spbj'ects) 


.23 

.15 ' 






• 


Adjustmnt JTPT ^ . 
(Inejqjerienced subjects) 
(Experienced subjects) 


.02 
.21 






T 


"Acquisition Radar JTPT\ 
(Inexperienced subj ects ) 

.^Experienced subjects) ^ 

\ ■•«'■ -\ • 


. 03 
.14 


, ' .36 , . 
.22 




I 

• / • o 


Target Tracking Radar JTPT 
(Inexperienced subjects ) 
(Experienced subj ects) \ 


.24 

.20 


.33 
.38 




' ' . - •< ^ » <• 


Missile Tracking Radar JTPT 
(Inexperienced subjects) 
( E xpe r ien c ^d ^ub j ects)* 


.09 
.19 


.15 
.32 






Ip^n^uter JTPT • 

(inexperienced subjects) 
(Experienced subjects) 

s • 

Total JTPT ■ '..^ • 

(InPq>erienced^^«subjects) 

"CExpferien^d si^jjects) 

' , ■ >^ ' . 


,.08' 
.06 

.20 


.2.4 
.14 


, ' ^ ■ ^ 

a ■ 


Growdeif et al . 
(*?^4) 


. TroubJLeshQotirig JTPT j\ ^" 


.11 


.>L8-.32 

* 
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The sellcti'on tests for Jboth mechanical and el'eetrpnic maintenance 
/specialties Wve .been standardized against composite scores from paper- 
and-pencil tests. This means that the people selected for the ' ' 
maintenance specialties have bqen selected not on their aptitude for ^ ' 
performijig the tasks of their maintenance jobs, button ttfeir aptitude 
for making high scores on ^er-and-pencil/ theory, and-jpb jcnbVledge ^ 

tests. • . ^ ^. • 

• ■ ' ■ •> * • ■. \ 

' Our specialty knowledge test (SKT) and the prcrmotion*%itneea. , 
examinatioiy (PFE) us^d for advancement up the maintenance career ^Ijjadxfers 
also are paper-and-peacil job knowledge tests, 'At the present; •timef, - 
throughout his whole career, a maintenance specialist not ^-refluired 
to. demonstrate on ^Srmal JTPT that he can efficiently and ^effectively 
perform the tasks of his job'. - - , ^ " ; 

N ^ ^ * * , ' ' 

The t^n-Machiffe Interf aj::e for Maintenance , 

The maintenance R&D supported*^by AS Was emphasized , the ^man-machine 
interface. .From this point of view, PM for a\l personnel Assgciated . 
with machine systeiils must detemlrie £he. ability o| such pei(sonnel to « 
perform tasks generated' by the. man-machiti^ interface. Althougli there 
may be some overlap, mo^t Of the task functions demanded l^y a machine 
sya/tem of its ppeiratir pef spnnel are different from those task functions 
demanded of its m^iintenance pe-rsonnel-, Herelfi lies most , of the unique; 
^ distinguishing cttatacteristics of PM .for mad^enance. .As a result, 
ith±^, section of my papet will be jdevoted to \ discni^^dion of the -cdtnplie:^ 

..ity iDif ^maintenance tas^ function's . * / / ' 

' ■ . .■ ■ • ' ' .~^> . - ' . \ '■ 

?ast Human /Factors rlBiphasig . " , 

» / N&ut before 'discussing ^ characteristics of task functions for- ^ 
' lA^tttdSnce , it might be WtelT to'call-attehtipn to the fact that l^an " 
l^vixs e^pablishments have given inuchjnote attention to the operator 
il^^tface with machines th^ .to theVln^intenahce>>^ei|«ianel interpce. 
Ml^y'a'itions are taken, to maximize effective artd effllient performance 
of the operator;- Work s tat ions^re humart 4ngiKeer^ 1 6 maximize the 
efficiency ^arid cpmfort of the^^h operator, iiajor' training facilities 
.are prpvided so that ,ppe.ratorJf^n* receive a iarge> amount of supervised 
practice in performing typical tasks* of their jdb> Graduation from 
training is babed primarily on^emonstraf;ed ability to perform job .tasks. 
And, peri<>dic checks are made of tlie pperator's ability td^perform the 
critical tasks of^his job. These, (of cotirse, are not all of the many ^ 
efforts made to' maximize the performance of huinan operators. 

• . ^ Generally,-*the huinan factors Establishment' has given little- 
at|enfr4.on tt> the efl^ci|veness and ef f icl^ncy' q|f the maintenance^jian's ^ 
interface with hardware. The maintenancg^wprk of AS-, tncludin^-the 
' PM has emphasised this neglected i^teTjface, * but typically, this 

^ patt bFour pro^^ has received littl6 maiftge^jeat visibility, or support- 

• ^ - . " ^ \ * 



The Structure of* the Man-MaQhlne Interface for Maintenance 

One of the results of our R&D for maintenance has beetj^ the evolution 
and articulation of a structure for handling maintenance functione and 
, their complex relationships In a systematic manner. This structure 
Includes (1) standard maintenance functions and action verbs, (2) a 
.working defj Itlpn of a malntienance task, and (3) sfchemes for handling ' 
th^ complexities of maintenance tasks. . . . * 

Sjtandard Maintenance Functions and ^Action Verbs , 

? The establishment of standard maintenance^ functlonis'' and .actlc^s *^ ' ^ 
verbs has been one ^f the^wld^y accepted reslilts of the Air* Fprjje. 
Systems Comman4*'S' (AFSC) jo6 performance alcjs" (JPA) effort entitled 
•'Presentation .b.^ Information* ^^)^R^^lnt^enance Md Operatloa'V (FIMO) . ' « r 
(Although the PIMO project waV managed by ^the %pace and Missile 
Systems Organization (SAMSO) of AFSC, AS provl4ed acClve participai^^ 
and technical Inputs during, the^ntlre project from 1966 through- 1^6 
AS' "has Incorporated the key. findings and outputs of PIMO in Its own^^JPA; ^ ^^'^^^^^^^^^^ 
efforts ..J ; Early In the PIMO project , It was found that many malnte*^ J^^^" 
nance ag^on verbs and functions' were used by maintenance' people, Bgime -^"^- 
with several different ^neanlngs. Part of this, confusion was caused ' 
< the. language used In maintenance tyechnlcal orders which were Mcltten* by. * • 
different people and produced by many different hjardware manufacturers. 
As a result , maintenance technicians themselves did not generally. Use ' 
pgreclse. language,. A study was made to._ iSlientlf y. and define- these actipn. 
>^erbs. Wljer^. two or more verbs were used to^ ir^dlcate a similar action j V' 
the preferred verb was selecte^d biased on ^h^Sj^^xpressed^pr^erence^vOf ^. 
^ sample o^ maintenance men.wi^th a wide range of maintenance Air Force ; * 
, Specialty Codes (AFSCs) . The use of the preferred verbs, of ..this ilst / 
is now a firm requirement of Air Eprce technical order specif icatior^ft, 
as well as of re^eht.Army and Navy specifications (see Joyce^^^Chenzoff , J 
Mulligan, & Mallcyry, 1973 , 'pp. 97-l'42) . ' / 

A Working Q^finltion erf a' Maintenance Task • ■ ' ' * * 

^ Within this list of action verbs are a number 'of ' key. action verbs 
(^fi^Hp^lons) . A key action verb, with an appropriate speatiij^Jbu^i^dv - 
!^ iM^t as its predicate, becomes ^a tasky statement . vj^iict^^^^ 
^ represents a maintenance task v^ich can be demanded 10^ 

ai^^p6ration of a^sp^ciflc maOTlne subsystem. A list ^if thi^lSe^^f 
" - ±s foun6 in AiFURL^ftHi^--43 (I)' (Joyce et al., 1973/^S9^^ ^^ 

list includes functions which aVe fouijd'^ln bothfipechanici^l anS eleftrbhic 
jobs. Some apply to only mechanical jobs awi 'some ap^ly ftg.bbrtfei' 

V w '' • ^ * • • ^ '• . ■ ■ r \Jf 

Schemes for the Systematic Considera,tit)n of Maint^enanQ&vF.unQ'tibng. and- ^ 



thtee/ schemes vhaye^befen^ developdTd for., the' 'systematic consldepration >y ' 
of jtoalntenancp functions and^ tasks, and t^ie key factors that" affect ""thfem. 



.V. . . . .... • • . ^ 
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Scfieme_One~A convenient model for categorizing these maintenance 
functions with relation to the type ofhardwar^ and the level of 7 
maintenance is presented in Figure 1. ' The common maintenance functions 

' already mentioned together with the usage of test equipment and hand , 
tools, are represented on one axis pf the model. Since mechanical and 
electroiAc subsyst^s usually require a different variety of mainte-* 
•nance 4ctions, they are represented by another axis. (In regard to 

<-.:this •axis, mechanical maintenance could be further divided into two 
categories, represented by' hardware such as jet engines, and^^^ 

another •by hJB^re such as airftiames and tank and ship hulls.) HP 




The tti^^^xis. Qf^ t^e model represents the three levels or categorie 
of maintenance n8 ^ tenind^ln the military services. . O'tganizationa]. 

\maintenance~i§\ thV first level., ^is usually aimed at checking out a 
whole machine subsystem and*corre^fin§ any identified . faults as quickly ; 
as possible. Flight line maintenance falls in this categor^. A system' 
is checked' out. It it does not work, the' line replaceable unit (LRU) 
or "black box" causing malfunction is identified and replaced. Tl^s^ 

" major component 'is then taken to the fiel^ shop (intermediate msflnte- 
nance) where il^ is again checked out and" the faults, authorized for \ 
correction, are corrected, ^he correct iy^ action^, authorised at the 
intermetiiate level, vary 'greatly from' syst;em to system depeiading on' 
the mintenance^ concjAt of eath system. On some systems, the mainte- 
nance man will troubleshoot the "black box" to the piece part level. 
in mope modem equipment, he. will identif^j^a replaceable module made up 
of many piece parts. Some modules are '*thirowh away^ c/thers sen^ to the 
'depot for riga^r* Any line replaceable un^s''»which the field shop are 

^"^JUable, or\^*t^ to repair are sent tcJ the depot for o\^rhaul. 

i^^-.- .j^-f ^ . ' . ■ ■ 

; Organizational and intermediate^ level •ft;rganizationis are marked 

primarily 1^^^^ tfechniciai^ whpse'^ average length-t)f service Is 

rather short (slightly more than 4 years in the Air Force)/^ Depot^^^re 
manned largely, by ciyilj^an persp^el with a much higher level of ™ 
experience .and longef 'retention time:* Using this modSf, it has been 
possible to specify areas of /Soncentyation for study. 

'Since PM requirements for maintenance are so differen^ for^ 
variou^l^blocks <i.ndicated in thi& modfel, it is 'extremely imp^ort^ 
. PM researchers indicate the precise ttlocks of their concentratibnv^^ 
date, AS has col1tce,nt rated on. the shaded electronic p^ortions of this" 
model (Figure 1) . / The resultant m&del batltery of 48. JTPT together with 
^heir symbolize substitutes wi^L be desctribed later*. In ad4itipn„ a ^jj^ 
battery of 11. JlPl^was developed on -an ad hoc basi? (Shriver '&lFqley, 
1975) for taechanic^l .tasks at . the organizational level of maintl^nanfie 
(see shaded, portion of ^ Figure^)- ijie HumRRt) work, ^mentioned- previously 
(Viheberg et^^alw , 1970a, 1970b; Vineberg^^ Taylor 1.972a„ 1972b) was 
. cOAcenied* with" mechanical hardware " (t^ift ^ tptruck) . The 1^ te^ts " ^ 
' developed .c€)ncerned ' this" maintenance/ fiM^ are^^^ indicated bj 

the shaded pol'tions 'of Figure 3,' ^ - ' . a, • , - 
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^o4 --Malnt;enance functions have limited meanirig unless 

^I^'lied, tO' ^eci?iW hardware A t^sk ideitUf ication^matri^ (TIM) » is an ' 
epcttemeiy effective and necessary device for, interfaclhg these inainte-., ' 
narfce*' function^ ,with the appropriate^ hardware units ^anj^ thus ideltitifying;^ 
the maintenance^^^skSy tljat are generatedT Ijy; a specif ic-machirfe ^subsystem 
(see Figure 4). The TIM*' when p^roperj^y structured , Wll reflect the 
maintenance level (^t level a pf interest, that is organizational, inter- 
mediate, and/or depot. AFHRL-TR-73-41(I) (Joyce et al^ 1973, pp. r6-37) 
provides detailed directions for developing a §IM. \ 

Scheme Three — A matter of serious concern when dev^^opdng and 
structuring PM for maintenance tasks is the interaction among the 
maintenance tasks for one hardware. , A f our-levej. hierarchy^^ ^ 
dependencies carf be stated. Figure 5 gives a graphic preseatatlon of 
thes^ dependencies am9ng maintenance actiyitj^^ fp^ ^ ®^®^^^°94S^'v^-t.- - ^ 
hardware. : , ' 

^ The che^ckout of the AN/APN-147 (Doppler 'Radar) , for example, can 
be a task in its own right. But the saioe checkout^ actdi^vity becomes an, 
element of other maj^Tv^tasks such as qalibrate. The calibration of 
do'ppler radar includes the operation of specific general and special 
test ^quipment^, the use of specific hand tools; as well as the check- 
out activity. Troubleshooting of an eleclironic equipment, such as 
AN/APN-147,-^ requires the use of general and special test equipment a^' 
It may require remove and install activitiesr and/or adjust, align, and 
•calibrate activipies. Efficient troubleshooting practice usually 
requires the use 'of a cognitive -strategy to 'adequately track the depen- 
dent activities (but the cognitive strategy iri itself is not trouble- 
-shooting).. Any troubleshooting task should begin and end with an 
'equipment checkout- Because '(Jf these "^various and varying dependency 
relationship^,, such actitrities as checkout, remove. Install, disassemble^ 
adjust, aligpH, calibrate, or troubleshoot cannot lejgitiinately be 
considered as -discrete tasks, even for one electronic 'system. 

. • " ■ ■ => 

Another confounding factor is the false correspondence"" that the 
same functional verbs create when applied to dif ferent -electifonic hard- 
ware. For example, personnel with the Ivioriic^Jnertial and Radar 
NavigAion Systems Specialist, fFSC 328X3, are maintaining af least 50. 
major electronic subsystems. Many vintages of hardware , design are 
.represent<^d: • The checkout activity for each is different^ (both in 
•'content andSrdifficulty) and in some cases, very difiEerentY The lack Of 
correspondence of alignment, calibration, and troubp.eshoot^g tasks ' 
from one specific equipment to another iis even greater. Ai^example of 
the lack of corr^pondence ftomone hardware to another is the wide ^ 
difference in"-^ content and difficulty of troubleshooting tasks" 
between two doppler ifa#rs.^ The^-4N-/APN-147, which is used^ofe -the C-130 
and C-141, ks approximately 14,000 shop replaceable units (SRU) whereas 
the inertial doppler naviga||ih equipment (iDNE) on the C-5 :ha0 only 28.- 
This lack of corresp^n^eilte of functJLons iicross electronic hardware 
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FiguTe 4-- ftample ofla Task Identification Matrix (TIM). Cell e^td 
(dash) 90 maintenance task of this, type is performed on this hardwj 
item; 0 - task of typeV performed at organizational level;M - task, 
performed at intermediate levelj and,..D..- task, performed at depot} 'level. 



(2) 



(3) 
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Checkout 



' (ise ■ haa^t 091-^5 , so).^ering ^ 



.Remove t Install i Disassernbl& ^, Assemble 



'Operate general -and Special Test E^uipm^nt 



Adjust, Aligl t Calibrate 



Trouble shoot 



Figure 5.- In(iicatin^ •f.he Dependencies among Maintenance 
Functions "for an'Slectroni'c' Hardware (Functions Underlined) 
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makes if difficult to .geneii|ilize from .r^suKs of PM from one electronic 
hardware to another,. Onfe exception is y in ^he area of general test. 
. equit)ment* which "may .^,he used in performing tnain1:endnce tasksVaqross many 
hardware subsys^tems- • » i . 

- -^e examples given are characteristic of many of the electronic 
maintenance AFSCs. Similar, prpblfem^^ ^complexity . of ' 
functipns and ta^ks ar^ found^in me*kni^ hardware, but to a lesser 
.^jiegree.' * S - J. 

, - * Development of PM and Symbol^^Substitutes- for PM 

'<jftarting-.ia 1969, AS supported ..a mod'est p^ro^ram to provide t:h6 -Al:r 
Force with the necessary tools for measuring the ability of maintenarlce 
personnel to' perform, the key tasks of their job^. _The scope of this • 
work was limited to the |||^inten^^, ot electronic hardware at. the ' 
ofgartizatiotial and inte'rmediate levels (see shaded 'portion of Figute -lS 

' iThis prpgTam has two objedtives: (1) to develop a model battery ^of 
JTPT^to get her with appropriate scoring npcheme^ for the masurement * o£ '* 
4:he task, performance ability of electronic maintenance' personnel (an 
effprt'fcs .to be made' for the development of JTPT vhich could be eaaily 
adminisHred) and (2) using the JTPT af^tJiis batteryvas criteria, to - 

develop and try put a series of paper--and-pencjl.s57nb.blic substitute 
tests thats would vhop(^ully have ,hi^h empiiJ'ical valij''^ 

Criterion Referenced Job Task PerfoVmanc^ Tests 

A;iibdel, batt^i^y of A8 crlitexion referenced,^ JTPT and a test .\ 
administrator 's handbook were^developed .fPr measuring ability tg perform • 
electronic maintejiance tasks. Copies of the actual iitstructioas fd^ 
test> subjects together with the test admini^^trator 's handtJoofif"^ are ^^-i ^ 
available from the Defense Documentation Center (DDC> as AFHRL-TR-74-57 (II) 
PaVt .II -(Shriver,' Hayes, & Hufhand, 1975). The test administrator *s . 
handbook .was developed with step-^-step detailed instruction's so ithat 
an in<Jividual with' a minimum of ell"ttronic mainVenan'cfe experience can 
administer the- tests. , ! • 
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' .ThfS battery includes separate tesfi^s for the following classes "bf 
job activities:- (1) equipment checkout , (2) alignment/calibration, 
(3) removal / repl acemen c, * (4) soldering, (•5);use of general and special ' . ' 
test equipment i .and (6) ^troubleshooting. The Dbppler Radar AN/APN-147 ' 
and it;s'. Computer AN/ASN-35 weq|* selected as a typical electronic system.* 
This system was used as the, test-bed for this model ^ bat t.eryljj^he 
soldering and general test equipment ^TPT are applicable*' to ffip. electronic 
technicians. The other tests of the battery apply to rVgifi^^Tanfa con- . 
^ cerhiB-d with this, specific doppler radar system. ^^detailed^desar*L^^^ 
.-of ^the development and tryout of thes6 JTPT i3 'giy.^P|l5^FHRL^ra^ 
P^t I (Shriver & Foley, 19?4a) .' ^^^ch cld^s of activity for ^f^^^^Kfrn-"' 
was\developed contains its j.it41vi|tial mix of behaviorS, but it2^HH|^ 
mutjA^ily exclusive.. As indicated ^n 'Figure^ 5 ^nd Table 1, a four-level 
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hierarchy of dependencies exists among them. 



After considering product, process, and time as to their approprl--. 
ateness for scoring the results for each activity, it was decided that 
a test subject had not reached- criterion until he had produced a complete, 
satisfactory product. This was a go, no-go criterion. 

Table 2 summarizes the number of tests, problems, and scorable \. 
products by class developed for the AN/APN-147 and AN/ASN-35^ The V 
simple addition of numbers shown in Table 2 Indicates that there are 
48 tests, 81 problems, and 133 scorable products. But these numbers 
tell us nothing in terms of the content of the tests. To say that one 
test subject accomplished 100 scorable products while another acconqpj^lshed 

90 tells us nothing about the job readiness of these indivi duals o r ^ 

that Tone iV^ better the bther.. The! varieties of scorable produces ~ 

are so diverse that any combination of them, without regard to what they 
represent, is meardLngless. The only meaningful presentation 6f such : 
information nwif^^ ii^' J^™^ ^ profile designed to attach jaeaning to 
such ^^^^^^^^iMj^^^^ ^^^^ ^ profile is shown in Figure 6. ^ 

Tab^^^^ Tests, Problems, and Scorable Products 











Scorable 


Class 


Code 


Tests ^ 


^ Problems 


Products 


1. Checkout 


CO 


2 


2 


2 . 


2. Physical Skills Tasks 


PT 


2 


5 


17 


(soldering) 








20 


3. .Remove and Replace 


RR 


10 


10 


4. Test. Equipment 


SE 


7 


37 


67 


5. Adjustment 


AD 


6 


6 


6 


6. Alignment 


AL 


10 


10 


10 


y.!^ Troubleshooting 


TS 


11 ' 


11 


11 ' 


Total^ y 


7 


48 


81. 


•N 133 



'this profile is not presented as the final solutlori to tife profile 
problem for JTPT for electronic maintenance. It does contain most of 
the Important information regarding a test subject's job task abilities 
as measured by the test battery, indicating the subject's strengths and 

weaknesses. * 

j< ■ . ■ ■ ' * / 

An exaltation of the profile (Figure 6) indicates that most of 
the tests in this battery contalii only one problem. For estample, there 
are two^eckout tests having one problem each, and there are 11-- 
troubl^ootlng tests having one problem each. There are two soldering 
testdi one has two problems and the other has three. The voltohmmeter 
(ycm) test has 20 problems. ^ 



ocPCNocNcies 




ttgmff 6. A ptwfile fui diiplj)riii|( ilir f rib1iiZbi*>ncd by >n indwiiiiJAl Mibjrci ftiuiti bjllrr; of Jv^Tatk 
FrflwtMoiicr Tmi cAi><<*>>H>K in I Uniuiur SyMry. - iltc AN/Al*N H7 >nd il.r AS'/ASN 3S Tt.ii i rfNr^mi 
tlir uf *M iihIiviJuaI wbo Iui ^{(ikkfuMy i/jiiiplrird nto* «f ilu ball rr^. 



The subject receives no "credit*- for a 'problem unless he:*obtaii/s 
all of the expected products. No attempt^;;4)S made to combine these 
scores in terms of meaningless numbers. 





Jhe hierarchy of dependencies discu8flf(S8Ppreviously (Figure p has 
Implications for the order in which teste! are administfered, as ^11 as 
for diagnostics. For example, since troubleshooti);iig Includes,- t^e use 
of test equipment and other activities in the hierajrcjiy, logic /would . 
dictate that in most trairiiftg situations the admlnitf fetation oC' the tests 
for the sub-activities would precede the troubleshoptir^g test^ and that 
a test subject Voiuld not be permitted to take the troubleshooting tests 
until he had pa(8sed these other sul3tests. Under some ^Circumstances, one 
may wish to reverse the process. A subject who" successfully completes 

selected troualeshooting or aligriment^ tests can be assiiroid^tQ- be prp- 

f icient in his use of test equipment and checkout procedures . These 
ependencies /are displayed on the left-hand side of the profile 
Figure 6). 

Due to/ the unavailability of a sufficient number of experienced 
test subje(/ts at the time of the tryout of the JTPT battery; the tryout 
was not as/extensive as planned. The limited tryout j^id indicat^e that 
the tests /as developed are administratively feasible. / Their continued 
use, no dpubt, would result in further modificat'ions and improvements.- 

Development of Symbolic Substitutes 

There is no doubt that a battery of JTPT would require more training 
and on-/the-job time of the test subjects, more equipment, and specially 
trained test administrators. Therefore, * the availability of empirically 
valid /symbolic substitute tests would be highly ^esirable. Even though 
previpus attempts to develop such tests as the T^b test (Crowder, 
Morrd^ion, & Demaree, 1954) had failed, it was our opinion that much more 
work/ could be done to improve symbolic maintenance . tests as substitutes 
for /JTPT. It was hypothesized that higher correlations possibly coxild be 
obtained by a different approach to the development of symbolic tests. 
A study of the Tab Testa (Crowder et al., 195A, see Table 1) indicated 
that the JTPT used as the criterion measures ^^contained many distractions 
and interruption's to the subject's troublesh^Doting strategy (cognitive 
(Process); such as using test equipment/ to obtain test point information. 
In addixion to such interruptions to^ the coygnitive process, the subject 
/ can obtain faulty test point information bv^the improper use of h^s test 
equipment. In the symboli^ substitute Tat//rests, all of these potential' 



pitffi^lls of the actual task were avoided, 
printed test point readout. It w^s hypot 
job/ equivalent pitfalls into symbolic suj 
(±T empirical validity/ 



fThe subject was- given a 
islzed that the injection of 
Ititutes possibly would increase 



Based on these hyp^itheses, a haxtdih of symbolic tests were developed 
der contract with th^ Matrix ResearcjW Con5)ariy .of Falls Church, Virginia. 
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A companion %graphic sjrmbolic test was developed for eath of the job ' 
activities for which a criterion referenced JTPT had previouslx been ^ 
developed,/^ Based on two limited validations, all of the graphic 
symbolic teists, with the exception of the symbolic testf for solderlng'j^ 
Indicat^]^ sufficient promise to justify further Consider>ation and 
refinejaent . Jable 3 indicates the correlations obtained f^om these 
validations. Due to a shortage of available subjects, t;he n\imber' of 
pairs of subjects was extremely small. . Ml\pf these promiding graphic 
symbolic tests, therefore, must be given, more ex:tensive validatio'tas 
using.larger numbers of experienced subjects: ' • * 

• The, validation of any Such symbolic test requires the admiiiistra^ . 
A^P? ^ co^Pa^ion JTPT as a validation criterionv- Aa„ a result ' 
validation is an expensive process in terms of equipment and' experienced 
manpower. The troubleshooting sjrmbolic tests require the most extensive 
refinement. Several suggestions are made for improving their empirical 
validity. A- complete description of these symbolic test efforts can be 
found in APlm^-TRj74-57(III) (Shriver & Foley, 1974b). An attempt, also, 
was made to devel^jb video symbolic substl^tute tests, but this effort 
produced no promising results (Shriver & Hufhand, 1974) • ' 

■ * ■ ■ ' ' ' 

Even if graphic symltpiic substitutes of high empirical validity 
can be produced, the use iyf symbolic substitutes will never, in my 
opinion, . dispehse with the requirement foi the liberal administration 
of actual JTf*T to maintenance personnel. We can never include all 
aspects of the actual performance of a task in a paper-and-pencil 
symbplic representation of that task, but bur work indicates that We 
can come much closer than has been done -in the past. , * \ 

The Sampling Problem 

Timewlse, it would be impossible to aflminister a JT^T to V mainte- 
nance man for every possible task that his hardware system might produ«!6*. 
This world of tasks and people must be sampled. The* model bdttery 
described previously provides a sampling procedure based on major task \ 
functions such as checkout, align, adjifst, trduhleshoot , etc.- But even 
this sampling across possible tasks resulted id 46 tests and 133 scorable 
products (Table 2). It would be impractical, to give any one test subject 
all these 48 tests at any one time^ Systematic sampling schemes must 
be developed across tests. 

The purpose for which JTPT results are to be used should be consid- 
ered vrfien. developing sampling schemes. Such purposes could include 
ascertaining <1) the job task proficiency of an individual, (2) the 
job effectiveness of a training program, and (3) the proficiency of a 
maintenance unit. Each of these purposes would requir'S\a different mix 
or mixes of tests and people. Some suggestions for such\;3ampllngs can 
be found in AFHRL-TR-74-57 (II) Part I (Shriver & Foley, M74a) . But it 
should be remembered that these are suggestions that must a^ill be field 
tested. ' \ 

' t 



" in the case of determining unit proficiency, some JTPT can be 
adminis^tered by on-line observation of tasks which are often repeated ^ 
such as checkout- There will always b^ a requirement for off-rline PM 
concerning critical, byt seldom^jjerformed tasks. Whether the JTPT is 
performed on-line or off-line, the test administrator must use the same 
objective scoring procedures, the criteria of success being an " » . 
acceptable product . ' 

Consalidated Data Base to Support PM . ' 

in keeping with its man ^machine interface orientation, AFHKL/AS is 
demonsti^atihg the technical feasibility of integrating #ive human 
resources related technologies and. applying them during weapon system- 
development. This is being accomplished under Project 1959^ "Advanced - 
System for the Human Resources Support of . Weapon- System Development." 

V. 

The five technologies are: 

Human Resources in Design Tradeoffs 
Maintenance Manpowey Modeling 

Job* Performance Aids ; ^ 

Instructional System*Design . ' 

System Ownership Costing \ ' r * 

One objective of this program fis to det^r&ine^ the data input j 
requirements for and prepare specifications for a consolid^ed maintenance 
task identification and analysis data base which will support the 
integrated application of these five technologies iit a weapon system 
development program. We feel that such a Consolidated *dat^ base will 
contain most, if not>ll, of the information whidh,wbuld be required to 
develop good JTPT provided the tests are developed In .keeping\ with the 
technology described in this paper. If 1^ such, a data^b^ae is demonstrated 
to be technically feasible and if it is .routiaely ma|e sk requirement in 
weap6ti system development contracts, it will. provide considerable 
assistance in developing maintenande performance test| for new v«eapon 
systems. ; ^ • ' ^3 ^ . ' 



Institutionalization ofl 'NeW^ technologf es 



> 



Getting newly developed technologies such as^^lfclnstitutidrialiaed 
is a perennial problem, especially when a technology* requires fundamental 
changes in long existing programs, procedures, W ^tfit^i^es of entrenched 
establishments. AS has been Involved In the' implemejitation.^f .several well 
developed arid documented technologies, such, as foh perfotmance aids and 
instructional system design (ISD) including programmed instruction and 
job (task) oriented training. These experiences have indicated that it 
is extremely difficult to maintain the integrity of a technology duting 
its so called implementation • Operational organizations invaluably 
atteiilt)t to implement a much "watered down" version of th^ te94mology 
and consequently obtain much "watered down" results. In some cases. 
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only cosmetic changes to existing programs are reported as implementa- 
tions. Currently, it requires many years of persistent effort on the 
.part. of the research community tp get a technology properly institution- 
alized. 

\ A mechanism must be developed for the* timely institutionalization 

A of each new technology which will ensure Its integrity. A mechanism 
\for the orderly implementation of ^tephnologies, similajc to that .used 
ttor new weaponis systems, is recommended. Such a mechanism must make 
efficient and effective use of the "know-how" of the deVelopers 6f the 
technology and make them responsible and accountable foi: its implemen-' 
tation. A new technology should not be turned over to a using command. 
fdW its operation until it is in place "debugged" arid operational— just 
. .asUa new weapons system is not turned, over to' an operAtioiral' conm^ 
unEil it has been "debugged" and proven to bj. ready for operational use.. 

. U Proposed PM'r&D Efforts for Maintenance 

•lU ■ ' ■ ' ' * ' • , 

i\ Excessive maintenance costs are never going ^to be reduced as long 
as dqn't have JTPT and/or empirically valid symbolic substitutes to 
ascertain how efficiently maintenance men perform the tasks of their 
jpbs.l In my opinion, the lack of such measures of maintenance 
perfoirmance is a post serious deficiency in DoD* As su6h, R&D in this 
area should have an es^tremely high priority. 

\ . < ■' ' ' . 

Areas for R&D Concen tration 

' ■ \ ■ : - . / . • . ■ . 

Fdr a l(^g-range R&D effort, five general areas of concentration 
are recbmmenifed; namely, JTPT and matching symb<^lic substitute tasts 
for eleit^ijcmic jna JTPT and matching symbolic ^bstitute tfests 

for mecnatiicai^^'maintenance , and- aptitude tests based on PM. The 
develQ^ent ^d field tryout of a JTPT must precede the development of 

substitute. The woyk on JTPT batteries for both electroniQ 
aricj^mecbikhical maintenance should be started as soon as possible. The* 
work pit? 'Itptitudd tests should not be started until JTPT batteries and 
the symbolic substitute tests have been completely field tested; More 
information concerning these areas of concentration follows: 

1- Refinement of Model JTPT Battery . (Electronic Maintenance)— The 
already ava^ilable model JTPT Battery (Shrivfer, Hayes, & Hufhand, 1975) 
should be given a large scale field tryout.- (Since the AB328X4 Avionics 
Inertial an<i Radar Navigation Systems Specialist Course, which' Includes , 
the AN/APN-lVf? and the AN/ASN-35, does not eriphasize the mastery of job 
tasks j. the equipment specific tests of this battery cdnnot be used in 
the formal cd^urse.) One thrust of this effort should be to further 
refine the battery including its administrative procedures.. Aj^sticond 
thrust should \be the development pf sampling strategies whijilx woiiid be 
appropriate for determining the effectiveness of tr^J^ftf^, iJrogr^nis and 
both individual^ and unit proficiency as ■discussed e'arlier undfer PM 



problems. This effort would requitC^apprbximately 2 .professional 
man-years plus the use of maintenance specialists as test; administra^- 
tors from the appropriate maintenance specialties. If it is necess^Cry 
to select a -system other than the AN/APN-147-AN/AJN-35 combination^ 
this work would require approximately 4 professional man-years. 

2. Refinement of Symbolic Substi^tute8 (Electronic Maintenance) — As 
previously Indicated, a ^number of symbolic substitutes "for JTPT were 
developed and- given a limited t;ryout. "Jable. 3 indicated that some of 
.the sjnnbolic tests show promising empirical validity; These promising, 
symbolic tests must be more thoroughly refined and validated. In 
addition, further exploratory devln.opmerit is required for symbolic 
sobstitute tests for trouble^shooting task^ in keeping wi^h recommend^.- 
Cions made in^^RL-TR>-74-57(III) (Slirivqr & Foley, 1974b)V This " ' 
effott would require between 3 'Ond 4 professional iman-years plus* the 
lise'of mainten^ance specialists as test administrators and test stibjects 
from the appropriate maintenance specialties. ' . , 



Table 3. . Indicates the Numbers of Pairs Used as Well as the X 
and the Correlations Obtained During Two Small 
Validations of Symbolic Tests 

- ^ ' ' • \ . ' / , ' - 0 - 



Test Areas 



N 

Pairs 








' 4 


4.00 


,1.00» 




14 


2.5Z 


.43 




. 4 


0 


0 




6 


2.67 


, .67 




6 


.67 


.33 




19 


6.37 . 


. .58 




•9.' 


1.00 


, ■ -.33a 




30 


6.53- 


'.47 


.68 


30 / 


16.33 


■ .73 


.81 


\39« : 


3.33 . 


.33 


.46 


15 


.07 


.07 


.16 



Novice Subjects (Alt us) 
Checkout 

Remove & Replace 
Soldering Tests 
General Test Equip * 
Special Test Equip 
Alignment /Adj ust||^nt 
Troubleshooting 

Experienced Subjects (TAC) 
Overall Troubleshooting 
Chassis (Black box) 
Isolation 
Stage Isolation 
Piece/Part Isolation 



^Jhis negative correlation was probably duei^^to^ a number of ^ 
deficiencies such^as (1) deficiencies *in the Fully Procedur^l'zed Job 
Performance Aids provided the subjects, (2) deficiencies iti the 
^-^-SH^ueneing of the troubleshooting JTPT in relation to the sub^tests 
in the JTPT battery, (3) maintenance difficulties with the AN/APN-147 
AN/ASN-35 system, and (4) difficulties with the content and administra- 
tion of tqst equipfnent pictorials provided in the original trouble- 
sliiooting symbolic tests. 
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3. Development of. ModeX 'JTPT Battfery (Mechanical Maintenance) — A 
model JTPT battery similar to the model battery for eJ.ectrpnlc malntenande 
described previously shouW be developed for a tyjpi^cai mechanical sub- 
system such as a jet, engine or tank engine covering bbth the organiza- 
tional and intermediate levels of maintenance. 'This mo.del should be 
thoroughly field tested. Sampling strategies as li^idicated for the 
eleqtronit battery should also be developed. ''This effort will require 
approximately 4 professional man-years plus the ude of maintenance men 
from;the appropriate maintenance specialties as test administrators 
and test subjects. , 

: -4. gevelopjient of Symbolic Substitutes (M^dianlcal M^latenance)^ ^' - 

An attempt should be made to develop symbolic substitute tests with high 
■empirical validity aftjer the model JTPT baft tery is available for 
mechanical maintenance.'. The same contractor should develop these 
symbolics as developed thd JTPT battery. A very rough estimate for 
accomplishing this .symbolic effort would be 4 professional man-years. 

J N , , • , _ ( 

/ ^ 5 . Job AptitudQ Test Research Based on Results on JTPT— R&D plans' 
should be made to utilize the results o£. JTPT and symbolic substitute 
tests for standardizing military aptitude Indices obtained from the 
Armed Services Vocational Aptitude Bajttery (ASVAB). As a first step , 
the military aptitude scores of dll test subjects used for the tryouts 
in the proposed JTPT R&D should be recorded. In addition, such apti- 
tude scores should be obtained during any school or field administration - 
of JTPT or s^^bolic substitutes. When sufficient data are obtained, - ' 
the degree of relationship betfween JTPT results and various aptitude 
indices should be obtained. Later, when a sufficient number of JTPT ' 
are used in the iield, a* formal R&D project should be initiated; to „ 
modify the ASVA^ to directly reflect job . success, ds measuretd by * JTPT % 

y . . . 

{l&D Strategy ' / • ^ ! - ' 

Probably the most cost-effective approach far PM for both ' 
electronic arid mechanical maintenance woi^d be to concentrate^on -the. : 
development; and refinement of JTPT on use of key test equipments " " 
prior \to proceeding with the other task functions of the proposed , 
model test batteries. As indicated in Figure 5,ithe use of general /' 
test equipment is a prerequisite to maintenance task functionfi dtiich ' 
"as alignment, calibration, and troubleshooting. In additioh, .general ' 
test equipments usually have wide usage in such task 'functions across * 
many hardware systems^ and there is a substantial amount of data 
which in/dicates that many maintenance men are Weak in their test- 
, .equipment ability. So, a general improvement In ability to use test - 
"equipment is aa. important and necessary factor for the general improve- 
ment of several maintenance task functions.. I would strongly recomvnend, 
therefore, the early colicentration for the proposed model te^t batteries 
•in thi$ ^rea. Each PM development .for a test equipment should be 
accompanied by the development of a programmed training package with 
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sufficient practice frames for teaching the mastery of air Its 

functions, Basic models of such training 'packages for 12 general test 
equipments are not available (see Scott & Joyce, 1975a through 19751) . 
However, more practice frames sl^jpuld be Included In these programs. k 

Closing Statement 

Malntenancl^^of hardware Is currently an extremely costly operation 
for the DoD. High maintenance cost Is the prlnl^ry cause of high 
systems ownership cost. For some electronic maintenance specialties, 
nearly 1 year of broad formal training Is given first enlistment 
personnel. And maintenance training generally is long and costly. 
Even with such lengthy training,- the efficiency of maintenance could 
be greatly improved. Improved j,ob instructions and information as 
well as Increased use of job (task) oriented training have great 
potential for decreasing maintenance training time and improving the 
job performance of maintenance tapks. But to realize such potential, 
the crlterla/for the personnel system (selection, training, assignment, 
and promotion) for maintenance personnel must be shifted to the . 
demonstrated ability to perform the tasks of their lobs. (The current 
criteria emphasize the ability to obtain high scores on paper-and-pencll 
theory and job knowledge tests.) 

In this paper, I have discussed what I think are the Important 
aspects of the criterion problem as it applies to the measurement of 
ability ^to perform maintenance tasks in training and on-the-job. 'Our • 
objective in its solution is to get as close to the real job as 
possible. When ^"on'-llne" task^ occur often enough, thJpr structured 
observation may be appropriate. But when such observations are not 
appropriate or when tasks occur Infrequently, we propose to have the 
tasks performed ."off-line*' in a job-like environment. Our approach to 
the development of such measures was started (with an analysis of the^ 
structure maintenance of the man/hardware Interface. _Based on the 
results of this analysis, we developed a model test battery of JTPT • 
for electronic maintenance. Using this model as the criterion, we 
also developed batteries of graphic and video symbolic substltt^te^ 
tests. Several of the graphic syn^olips have indicated respectable 
empirical validities but require moi-e refinement and tryout. Our 
attempts to develop video symbolics were unsuccessful. 

I have recommended a research program based on what we have already 
accomplished. This Includes the development of a model battery of 
J'TPT together with s^bollc substitutes for malntenancie tasjcs generated 
by a typical mechanical hardware. I have also discussed briefly the 
perennial problems of getting new technologies such as JTPT Implemented. 
There is definitely a requirement for a structured mechanism which will 
guarantee the orderly institutibnalization of such technologies as well 
as their integrity during the implementation process. 
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FOOTNOTES 



Ei^raneous remarks by Dr. Foley . . 

1. I want to say something here. I said, "for reducing training time." 
I want to make it clear that I didn't say "reducing training cd^t," 
because T've been accused of that. Your training costs, when you 
get into job-oi^^nted training, go up — or at least stay the same~ 
your ♦training coaifcs per course are probably about the, same. The 
only thing is t^y're more costly per week, but by reducing-t^ain- 
ing time you do reduce cost as time in the field, for the more * 
time you have a, man in the field in his first enlistment, the less 
often you have to replace him. 

2. Now, we, don't have quite that bad a situation, but w^ cover up 
that situation in the field of maintenance by gobbling up a lot 
of sflare parts, and that's been costing us all kinds of money. 
Anytime we can get our hands on spare parts that have been turned 
in, we find that a great many of them are still good butf^they are ' 
destroyed because people are what we call "shot-gunnedv^jond f&und 

a faulty part by removing and replacing a large number of good 
parts. 



CRITERION PROBLEMS 

Cecil J. Mullins and Forrest R. Ratliff 
Personnel Research Division 
Air Force Human Resources Laboratory 
'Brooks Air Force Base , Texas 



^When we first began struggling in this^ wonderfully complex airea of 
criterion anM.ysis and development^ we were almost' oveinwhelmed by the 
assortment of special and seemingly divergent problems associated with 
criterion .variables. These were problems that seemed to be unrelated, 
to predictor research, and even unrelated to each other.- ^or in&i/cance, 
how ultimate should a criterion be? Are we trying to. select people, who 
will do well in training, or those who Will perform sktisf actorily on 
their first job, or those who will , get through their tirst hitch, or ^ 
those . . .? Prev4.ous work (Ghiselli & Halre, 1960; Prien,^ 1966) shows 
rather clearly that those subjects who are high on some proximal 
standard are not necessarily high on any of the more distal ones. 

Also, what is the best way to collect criterion information? 
Ratings are cheAp and they have a certain ring of truth to the rater, 
but we know that ratings tarely work well, particularly in the opera- 
tional situation. Assessment centers and job-sample data are far too 
expensive fof routine evaluation of subjects, and there are certain 
conceptual difficulties even with them. How does one collect performance 
data in one situation in such a way that the scores issuing from the 
exercise are comparable witjti scores on other people doing*^ essentially 
the same work but in a different conditiop, with a different supervisor, 
And a different social climate? ' How does one even demonstrate that a 
particular criterion variable is good or bad? Somehow, it jars to talk 
about 'Validating" a criterion. . » 

All in all, the most serious difficulty we had'was the lack of a 
philosophy or orientation. We needed some way of organizing our 
approach, some framework which might systematize our .thought and^our 
effort^. We haVe come to a way of thinking about the problem which, 
at lea^t for us, has proved somewhat helpful. 

Let us rbnsider what we mean by the word, "criterion." Of course, 
thepe is the. purely statistical meaning of the term, which means simply 
a target variable which we are trying to reproduce by appropriate 
mathematical manipulation of other variables. Statistically, the 
ci^iterion could be any variable, and the predictor cDuld be any other 
variable. But I am referring to the conceptual meaning of "criterion," 



9f I f 



da distinct from the word ''predictor. " Let us e;xainine some of the 
/fatuity ideas I have hel^ for several years about criterion and predictor 
/variables, I don't know; , If anyone here has ever held these ideas, but 

I do. find them rather widespread. We are not talking here about formal 

definitions^ but only abqut conven^f tonal wisdom, 

* - '4 ** 

. Example 1 * '-Predictors are aptitude-type variables and criteria are 

^ac^evpment measures revealed by some kind of performance. I grew up 
with thl9 idea, arid I have since found it to be a fairly common miscon- 
ception. Actually, practically all psychometiric variables ara achieve- 
,ment m^asutfes. We are not by any means ttie first to notice this 
Cr^mdike,»1926.'^Estes, 19740.' Tests of verbal aptitude, for exaiiq)le, 
are .usuklly tests of a subject's current achieved ability \to perfpoa 
with words.' All aptitude measures that I can think of are really tests 
of achievement, ju^st like critei;ion tests. On the 'other hand, it is 
generally accepted that the best predictor of future achievement is 
past .achievement . Upbn examination, then, this distlpctibn between 
criterion and predictoi? disappears. . . ' . ' 

, > Example ^2 . Predictor variables usually represent something "b^sic"- 
perhaps even genetic — while criterion measures r^p^esent some sort /of 
ultii^ate achievement Acquired by the subject ..through training or / 
experience.. This dist^ction may be partially true^ in tha,t development 
of characteristics continues from birth to death. But we think it .is 
riot .true in the sense in which it is frequently understood. To /use 
verbal aptitude as an e?cample again,' there is no substantial evidence 
for the exl^stence of verbal aptitude as a basic dimension of hitaan 
ability except that it ap pcj^ca i n one particular kind of facto^^nalysis, 
and even then only i'f the data are collected on subjects oldet than a 
certain age. We ,tbink it^kely that the^re are basic aptitudinal under- 
layments, probably genetic, but that these are far more simple and 
fundamental than the Thurstonian aptit'udes^ There are probably some 
very raw indiylduaJi differences present at birth,, similar to Horn's 
aolage functions (1968) or Cattell's fluid intelligence (19a1) . 

To let Horn speak for himself: / ^ / 

' (The Anlage function) represents very elementary / 
capacities in perception, retention and expression, 
as thesQ govern intellectual performance. 'Fot - 
example, span of apprehension — the number of , 
distinct t^^lements which a person can maintain/ in 
immediate awareness — is an elementary capacit/y and ./ 
yet ori^ wt^ch determines, in part ^ the complexity 
with which one can suci^ssfully cope in atn / 
intellectual task. It would seem that such/ 
capacities are not much affected by learning — 
anlage. "funcDloiiing is closely associated with 
neural^physiologicdl structure and process-r- 
but that such functions operate to some extfent 
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/^Exactiyfwhat the anlage*f 
the r^ ate, is still a matjter 
t:hey may be , they are seen as 
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ctions are, or §ven| how many of them 
be determined by rejsearch. Whatever 
mmutabl^ individual differences — 
stable and constant throughout the life 

see 'later, the anlage functions can 
le depth of learned iqaterial,. so that ^/ 
cteristics is very diffictilt, but the^im 
sanfe quantities as they existed at -jlp' 

m 

qther^measurable conditions have ocfmWV^d 
and hav^ injteracted that soi^khing as advanced as verpal (or num^ 
or spaitilal)^ aptitude develppli'to a measurable degree • Thus ,^ it 
entirely* logipal that in somH situations a test of verbal aptitv 
"be used ks a criterion measi&fe to be predicted by the more basi(|M 
func^tion^. Similarly, lateir jlevelopments (say , .performance iri p|fcholbgy 
201) mig^t with equal logicf constitute, a criterion to be predict'^ by a 
'verbal aptitude test , and some other behavior (say, jprogress :im"tea4^ 
psycholojglst) tn^y\ be predicted by grades in Psychology 201. Inffum; ' / 
then, thpte Is nothing ultintate about any "criterion," and noth^ihjg basic 
about an/! "predictor" with the possible exception of those unknpwi 
anlage >f mictions we just ^mentioned. . » ' , ' ' * 
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ile 3 . PrediGtoi; variables are simple, factorially pure 
land criteria ire complex. Since development normally prpcjeeds 



, measures 

from morjfc Islmple to more complex, and since criterion measures are 
usually Raken later than predictor measures, this is probab/ly t^^e ir^ 

• a general sense. However, there is nothing absolute about /thij^i ^ /' V 
principld, Neither. >For example, the l^st, time I looked, -tlsL^'i^^ :] 
single pjifledictor of college performance (a/criterion) was 
perform^c^A (a predictor) . There are other, much purer, pp 
measures^ bl^t thely don*t ordinarily do . as good a job as thpi 
complex Vfl^riable of high school grades . 

' Exami^le ' \ J Predictor data are collected at ^ earlie 
criterion datai. So far as we can, tell, this is the only gj 
ment onjfe can Accurately make about the' distinction between]<ii 

* and prekictoiftf. All the othe^^ distinctions, as we have se^jjj 
disappekr entirely upon examination, or exist only partiallj^f 
soiqe o£ the time- i * ' .vt 

* M ,,^^ 

\rfiere does all this lead us? It seems to me to leac|;^^'^^__ 
conclu^on that there is no such thing as a ,!'criterion" proi|i(^ 
distinct from "predictor"^ problems . There are only measureriftiife 
problems, equally applicable to all measurement, whether preMc^ 
^ criteria. The measurement problems conq^^em the hasp ways to{^|j^;iect 

— ' • • ■ ^ • 
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il working formula — there are too many ^ 
unknowns in the terms — but Jjit /does have some use to us in helping us 
order our thinking. For ex^ple, Xh±3 ^formula tells us ^at two 
people with different potei^t^al can arrive at the same state of 
development at the same time/because of differences in opportunity and 
energy (lines A and C,. Figdrfe 1, converge between t5 ^d* tg) . Our 
practical experience tells \is that, indeed, this sort of convergence 
does occilr. Also, this oti^ntation suggests that the best^ predictor ^ 
of sopie developmental poi/pt/ (a criterion) is the nearest' practical 
earlier pgint, measured fully. Otherwise, one, must know much more than 
one usually kno^?ls about opportunity and energy, since the longer 
, the time period sepa^ratin^ the two. points, the larger '"t"^ becomes In 
the equation, and the wore impo^rtant opportunity and energy become. ' 
It has helped us a great deal" in thinking about intellectual develop- 
ment, and criteria are , ^s -ve see them, only points on the curve of 
intellectual development. 

^ We have said there ia no specific criterion problem — only ' 
measufemepnt problems, .Heaven knoyrs these problems are severe enough. 
As we look at them,' they fiall into several dimensions. Keep in mind 
that all isubject as^4ssment, whether taken earlier as predictor infor- 
mation or later as criterion data, can be collected in the M&Bie ways 
and are. *af f ^^cted by the same difficulties. There are no. Special 
difficulties unique to either predictors or criteria. ^ 

MeasuremenL data can be collected in many ways." Some of 
rtant';5^ay^ are: 

1. Ratings . We can ask the aabject or someone else to give us an 
opinion.' On the .relatively low level of measuring aptitudes, we have 
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been able a long time back to move from opinions -to tested performance . 
One reason for our success in that area has undoubtedly been our 
ability to validate and refine aptitude te$t ideas againet various 
criteria. But we have not been so successful in this respect in, our 
development of criterion measurement ideas. Possibly one reasons for 
our lack of success here has be^ lihat^we have not seen criteria for 
vjhat they are — ^oiitts along a development continuum followed by other 
points a'gainst which it should be possible to validate them. When wfe ; 
look ^at criteria in this way, it ^eems to me *that we don't haye to 
settle for' the desperate position of Nagel^(1953) , Brogden and Taylpr 
(1950), »and others tfoat criterion measures by their nature are always 
jiidgmental (i.e., not Subject to verification). Ve can validate 
criteip^a against later^ criteria and proceed with tjrit^erion development 
in. much the same way we hav^ dond with^ predictors. When we^Jij^ve brought 
the state-of-the-art a little^ higher^ we can perhaps dispense with- ^ 
ratings as crite^ribn ^ata, just^ as we have done on thie predictor side. ^ 

We shall see later that there is another, probably more important, 
reason that we use ratings so often as ^Iteria. At any rate, ratings 
are now used much more often to collects criterion data th^ to collect 
predictor data. There are a few thinJ^ to rgcommend ratings — -'they are 
quick and cheap and, under the right conditions, they can Qe made to 
yield useful information- about the ratee . On the negative side, sope 
problems inherent in the nature of ratings-data loom very large. ' 

There appear to be individual dif f ei;'eaces (as one should suspect) 
in the ability of people to assess other pegple accurately. We are 
doing work on this x phenomenbn , which Mr. Weeks will tell you about 
later. Furthermore, even good raters ate often put in a situation 
which militates against the coJ lection of good information. If the 
ratee is to have access to the rating and 'if the rating is to influence 
the ratee 's career in axiy way, it is not likely , that a supervisor will 
produce ratings of hit* people which can be considered a good assessment 
tool. The supervisor is placed in a position which requires him to 
perform two mutually incompatible acts. As a supervisor, he is respon- 
sible directly or indirectly for the morale and energy of his work 
unit, which calls for support by him of his people; but he is also 
required to render' an objective and accurate appraisal which is likely 
to damage some or all of thpst same subordinates. It is a rare super- 
visor who can do Uolh. As a result, all the operational rating systems 
that I am awkie of suft^r Lhc usual inflation of means and compression 
of variance. I do iiuL believe Llierc: is any way that a useful criterion 
can be collected in militaLy en vl r6tiinenL from supervisor ratings 

collected operationally in the uoual way, su we have to look tor 
innovation. We are doing work wliich we tlilnk will alleviate this p^oblei 
somewhat, and I shall rep^^rt ii\.^re fully on this effort later. 

2, Job - a aiitp l u ^^'l' ^fii»4plu ic:iLi3, 111 their usual tormat , 

uL^ H^^^tiibitively eApensiv.. tot operatioi4al u^e . I say this despite 
the comments of several cisL..Le .observers (e.g., uLis> 1953) who have 



pointed out, in effect ,"t'hat'3lnce good criterion Information Is 
absolutely basic to all ' personnel actions, we should consider any 
expense connected with Its ^o}.l£ctlon a very good investment. We 
believe that a certain amount of actual job slniula,tlon, or assessment 
center type evaluation, . must be available foj: research purposes, but it 
is probal^y impractical to consider this kind of criterion for anything 
other than experimentation or in the evalu^tlon^of less expensive 
methods. We are embarking on^an effortv€o capture as much- of the.v . ^ 
essence of a jo^ as; possible on motion picture film, which c£m then be 
used as a tes.t stiiitulus. for collectlngjbrlterlon informatlor^ in^ large 
grbups, thereby reducing its cost appreclabl^y. ^ 

'V - ' ' ' , , ' 

3- •There 4re, of course; other ways to collect, criterion informa- 
tion (e.g., pap^rT-and-pencil tests), ^11 of which pose problems which 
eventually we shall have to address. Sqi&e of the work we are doing^ls • 
on paper-and-pencil criteripn tests, the items of which are selected to'" 
Anaximize di'f felrences between subjects at different career levels. 'j^jSft^ 
regardless bf how the data are collected, there are o^er dimensions* 
of problems which must be' considered also, so we must tnove on. \ 

Use of Data , . , 

Criterion data can be collected^or many purposes — to promote, to 
serve as a target variable for predictor tests*, to indicate need fo^ ^ 
training, to be used iii reassignment of duties, and many more. When we 
consider a particular s^t of criterion , data, we should clarify as early 
as possible what use is to be made of it, since the use may affect 
decisions as. to how, when, and from whom the data should be collected. 
Most of our particular effort in AFHRL is directed toward development 
of some reasonable target against which we may validate our predictor 
tests. Historically, we have used technical school grades for this 
purpose, but the Air Force is rapidly moving to self-paced training, 
which poses very sferious and rather obvious difficulties for psycholo- 
gists who are charged with the development of selection procedures. 
Anyone concerned with the development* of criterion instruments must be 
concerned with problems in the use dimension. We have all. seen 
criterion ratings collected which were a hodge-podge of attempts to 
evaluate a person's current status, his future potential, and his past 
performance all rolled, up willy-nilly into one exercise. 

The use should be clarified and stipulated as early and as 
thoroughly as possible, ani deci^iions taken at that point. For example, 
a criterion may be needed as a basis for rewarding P'ast behavior. In 
that case, criterion informaLion obvitHisly should be limited to past 
behavior — ratings of potential ate somevmat inappropriate. On the other 
hand, management may want to know which of several candidates is most 
likely to perform well in some new job which has opened up. In that 
case, ratings of potential would be preferred (incidentally, notice that 
ratings of potential are not really criteria; in the traditional sense — 
they are predictors of future perfoimance , even though ratings are- used 
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to collect the litorination)|» Or perhaps the reason for collection *pf 
the datajnay be to decide whether or not to train particular en^Hoyees. 
If so', perhaps a coiiq)^rison I (not a coi^glomeration) of current accomplish- 
ment and potential would be iin order. The point is that a whole' 
constellation of problems revolves around the uses to be made^^of , 
criterion inforw^tiop, and that a great deal of thought should he^iven 
' to. the projecte^use of the ini^rmation and the time line of Ikitellectual 
development befbre the. first step is taken to collect the data. 

• Level of complexity . ^ 

Still another dimension of measurement problems is created by the 
fact that intellectual development proceeds from more simple to more 
complex. * . 

1* The economics of rating attractiveness . It takes longer and 
longer to observe all the necessary perfoi^mance elements the further one 
moves along the continuum of intellectual development, since learning ^ 
builds upoA learning and current status consequently becomes morfe and 
more complex. This is perhaps the primary reason why ratings have been 
used and will continue for a long time to be used so prominently in the 
collection of criterion information. ' ^ * 

If one i& measuring complex behavior with tests, he must be prepared 
to require his subjects for longer and longer test sessions. One can 
measure physical strength, reaction- time, visual acuity, and other 
simple characteristics in only 2 or 3 minutes each.- It takes about 
a half -hour to get a reasonable measure of verbal ability. It muld 
probably take at least 2 or 3 daysL-of ,te sting to get an adequate aamp^^ 
of behavior which would indicate a subject's proficiency in, say, \^ 
aircraft engine repair. Indeed, we have seen reports describing some 
proficiency tests that require up to 11 >^ays to administer (McKnJ^ht & - 
Butler, 1964). a 

One assumes that a rater has already observed the complex behavior, 
of interest for sewral days, and, given the proper conditions, he can 
report it with some^objectivity. There is great appeal in an assessment 
metric v^ich can be collected with no co^t of subject time and very 
little of supervisor time. We have not jret been willing to pay the 
price of obtaining more objective and morV accurate test data, sb we 
sacrifice the greater objectivity of tests for the great convenience of 
ratings. Furthermore, ratings can be collected on any level of complexity 
desired, and I suspect that is why rating data collected in one situation 
frequently predict rating data collected in another situation,, despite" 
our certain knowledge that most sets of ratings contain many flagraxTt 
errors. The very fact that ratings can be made of very coiiq>lex^ / 
behaviors, compared with tests, means that we catf. reduce the distance 
between and D2 in our formula and thus reduce the very important 
effects of potential and energy not well measurable at the present time. 
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.We do not -contend that this is as it should be, but it appears that this 
is the way it is and will continue to be; so we believe a strong attack 
on rating, problems is of prime importance. Some of the rating problems 
that come immediately *to mind^ are: 

a. How important are the old reliable problems, such as halo, 
leniency, and' the like? * 

b. What kinds of factors or .characteristics make the best ' 
rating medium? In what formats should they be cast? 

» 

c. Justr as there are apparently individual differences in 
rat^r accuracy, are there also reliable indiv^clual differences among 
ratees which affect the accuracy of ratings mdde on them? 

^ ' . ^ ; , ■ 

d. Assuming that we can measure individual rater accuracy, 
what can be done in a situation using rated criterion dat^a to improve 
the p^sychometric qualities of ratings dbllected from a mixture of both 
accurate and inaccurate raters? 

e. We are convinced that if one intends to do research aimed 
^t a better understanding of criterion vafiables, he must be prepared 
to do some social and organizational research as an integral part of 

his effort. Such a simple problem as a slippage in the worker-supervisor 
interface can cause very sgjious problems in performance evaluation. 
If the supervisor see^tSe^t^s primarily A, B, and C, and the worker 
sees ft fis primarily D, E, and F, the worker can be busy as an ant doing 
the wrong things. 

.* 

^ We are studying all these rating problems', and we appe^ar to be 
making a' little progress . 

2. Relevance . As one attempts to measure more and more coii?)lex 
behaviors, relevance becomes more and more important. Several investi- 
gators (Brogden & Taylor, 1950; Nagle, 1953) have pointed out the 
necessity of attempting to include all important elements of the 
criterion in the predictor set and to exclude from the predictor set an 
elements not present in the criterion. That,. of course, involves a much 
more vigorous analysis of criterion variables than we are used to. But 
I am sure you are all familiar with relevancy problems, and they don't 
tieed to be restated here . 

We see this act of yxobleuiii as involving decisions about where and 
how ccflBpletely to sample) behavior along -.the line of development. For 
instance, it is likely that one who performs well on a test of mechanical 
" aptitude will do well as an automobile mechanic if other conditions lead 
him to attempt the skill. A good automobile mechanic is likely to 
become a good carburetor specialist, and so on. If we want to find 
someone who will become a good carburetor specialist, do we measure his 
mechanical aptitude—whicii we^ can do quickly and easily— but which, by 
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its nature, is too simple f^ctorially to! capture much of the variance 
we are interested in? Or do we measure his general automotive repair 
knowledge which is closer in time and in complexity to carburetor 
specializing but which is far' more difficult to measure? 

Questions of this sort have no easy answers. Trade-offs and' 
compromise must -be the order of the day until some breakthr^gh. enables 
'us^to measure complex behaviors much more satisfactorily than we do 
now. or until we learn l:\ow to use measures of simple behaviox in a 
better, more comprehensive system. ^ ^ 

Or>e of the pitfalls we mil^t be aware of is the seduction of a 
criterion just because it is there. Indeed, if the criterion metric is , 
already there, just waiting for us to come use it, we should consider 
it immediately suspect. It is undoubtedly relevant for someone's 
purpose (or one assuiites it wouldn't be collected), but it may have 
little or no relevance for whatever tpeasuremflint concept bhe investigator 
has in mind. 

To sum tip, then, we believe that the litt!fe^£^rmula, D2 = 
(1 + i)^, and the line of intellectual dey^lopment^fetled by the 
formula, has led us in som^ directions which we feel to be profiiisiAg: 

a. . Because of the current difficulty of measuring complex^ • 
behavior, we believe "ratings will be relied «upbn for a long while to 
come. Because this appears true, we intend to concentrate a large 
pbVtion of our resources on studying rating variance and trying to 
understand and correct for rating inaccuracies. 

b. IjL would certainly- help a great deal if we could plug in 
some solid values^ for the potential, the opportunity, and the energy 

.which makie up the t^'rm "i" in the equation, so that prediction of- some 
point on the develo-^rfent line could be made with a more complete set of 
the simpler, more basic predictors. Some crude measures of all of 
these terms are already available, but a great deal of research needs 
doing, oriented .aro^und this point of^view, to attempt to produce a 
more usable system. 

3. A great deal ot i.e£>earch needs doing on ways to measure 
complex behavior in an accepLable framework of subject time and overall 
expense. Some of our most strongly held psychometric ideas may have to 
be re-examined, particularly la our attempts to measure complex 
behavior. For Instance, one cannot demand high internal consis^tency of 
items if he is attemptli;^ to construct a test which is deliberately 
complex* Indeed, It may well be that some techniques should be applied 
to item selection wtiich almul Laneously minimizes^ internal consistency 
and maximizea validity, such as the Hotst Fan Technique or something 
similar. 

\ 



Probably the fflxmula is an oversimplification, but, whatever else 
the formula may have done or not done/Ve are certain of one value it 
has had for us. Though it may be illusory, it has at least contributed 
a littie to our peace of mind as we grope our way through this maze of 
very complex pi'oblems. ^ 
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FOOTNOTE 

. r 

Extraneous remarks by Dr. Mullins 

1. To begin with, the paper that I'm going to give this morning is a , 
purely sperculative paper. This particular one simply describes our 
philosophy and ways that we have dfeveloped of looking at the 
criterion problem.. There is nothing empirical in it; it/s, as I 
s«y» just pure speculation • However, it does lead up "to a point -of 
view which has helped us quite a bit, and we hope it will help you. 
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■Performance rat ^.ngp- have been in the past and probably will cdn^ . , 
tipue to b^ in the future the most common means/ o^ .measuring ;j6b ^ 
perforinance. *.The reasons for this are that they- Jean be quickly 
obtained and are Teiatively inexpensive as Compared to other techniques 
— af - measurement : - J)es pit e th e frequency o£i_theirLjQGCUiLretic e y there are 
many drawbacks* to uBin^ performance ratings. Their typical low . > ■,.• 
;Mliabillty and validltiare^generaliy reco^ized. Indeed i the 
;i^asurement prcAlems aglociated with ratings are so difficult -that , 

some' researchers have suggested that they not be used at. all (Ronan & 
■^Schwartz, 1971): . ' . • 

: ; Tl>/iljaBic problem with ratings lies -in the fact that they-aotually 
representsecpnd-hand accounts of |>erfoTmancei With paper<randr-^^fiil.- . 
devices, the subledt records 4iis.performance on a piece of paper, a. 
.veMfcle which is not subject to^hange, distortions, misunderstanding^, 
poof inemory, or gastrointestinal ailments. Such is not the case with « 
ratings. The subject's performance is recorded in a particular situa- 

• tion, through a percfeptixal filter, oh the |Demory of the rater and then, 
on sojoe later date, is transjEerred to paper. V 

Apa'rt from the difficulties associated, with the performance 
evaluation process itself, rating research is often conflicting and 
repetitious. Evidently the reason for this lies in 'the fact that there 
is no generally accepted theoretical framework which serves a? a guide 
to research. The majority of rating literatyre is devoted to the . 
development of rating Scales. Although'the development of an'^objective, 
error-free rating scale\s highly de^sirable, JWtings are influenced by 
many variables^ all of which deserve cbncerte'd research -atten^iion .. 

Oie rating para^gm, as' we perceive itt, consists of at least five 
basic diitensions: \Cll At the top of the list ip the rater- ,We know, 
fot example', that his\ocial adjiistment, intelligence, similarity with 
the ratee, and position relative to th^rafee wijl have substantial- 
influence on ratings (Bruner & Tagiuri, 1954). There are probably many 
'Other rater characteristics associated with rater accuracy, as well* , 
(2) The second dimension is the- person rated. People differ in term8_ 
of the degre%to which they can Ue accurately evaluated. Allport (1937) 



haiB i^ndlcated that some peraons are mare easily evaluated becauA they 
ha>ve tiibre ^'open'' peraonalltj]|e8. Others, because they are more /'enigmatic 
* are less easily evaluated. (-3) To t;hese dimensions can be added tl>e • 
trails or ^ tasks to be rated. The value of judgments will vary-^pend- » 
Ing on whether br not the traits employed -have observable behavioral • 
li^nlfestatlons (Allport, 1937). JLLso, It has been found that theV. 
accuracy olf -ratings will decrease as the complexity of the task rated 
Increases (Harris, 1966). (4) The social environment in which the 
ratings are collected will also have an effect, ICLpnis (1960) indicates 
that leniency in: ratings is more likely in a social environment described 
as supportive than one described ds stressful. (5) Finally, the. 
physical environ^oent will Infl^uence ratings. Persons who are less . 
observable diie to arrangements of' the work space w±\l be more difficult 
to rate than those who perform in a situation that is more conducive 
to observf^tion. « • 



The last and per)iaps lAost important consideut ion, although not 
dtrictly a rating dimeiysipn, is* the purpose for wnich ratings are 
collected. The value of i;4tings will differ depending on whether they 
are bollected for rese£^rch purposes or for management decisions such 
as promotions and salary increases. The 'Inflation of means and 
compression of vari^ce typical 'of ratlings collacteid for management 
decisions frequently eliminate «.theia as useful critetia for purposes of 
test validation. ^ 

Obviously, the ' variables within each of th^se dimensions are quite 
complex. Considerations as to the manner in which interactions among 
these variables influence ratings boggle the mind 4 Our first research 
effort .focQses on one of these dimensions, the rater. Specifically, it 
will be more concerned with the pverall accuracy of judgments of 
behavior rather than with "^separate factors associated with rater 
inaccuracy such as central tendency, leniency, an<f halo. The goal of 
Qur^ research is to maximize th^^uality^pf lj|ting d|tta uiedj f or valida- 
^tion Studies. If it were posaxpl^^to ilj^ntify the more accurate raters 
and use anly their judgments i «^ would be in a considerably better 
position ''to determine the validity of our selection and classification 
liistruments. 

Scientific' interest in the accurate rater or the good judge of 
persqnality occurred frequently in the 1930' s and 1940' s but eventually 
gave Way to the investigation of rating errors. In an excellent review 
pf the literature devoted to'' the ability to judge personality, Taft 
(1955) indicates that the ability to judge is related to ^Intelligence, 
self-insight, emotional adjustment,' and social skill. He further pointf 
out that accurate judgments are based on possesising appropriate judg- 
mental norms, judging ability, and most importantly motivation. , 
Recently*, <5ordon/ (1970, 1972) performed some very interesting research 
into t]^& Aature of rating accuracy. He suggests that rater inaccuracy 
is due to two types of errors, either "falsely accusing th^ ratee of 



doing somethin^ncorrectly which Wa6 in reality -done corretrtly" or 
"giving the r^ee credit for something that was actually done incorrectly. 
He provides evidence indicating a'greater occurrence of the last type 
of error; that^^is, "giving the ratee credit for incorrect behavior," 
and coivcludes that the accuracy of ratings depends on whether or not 
the behavior observed is correct or incorrect^ 

■ * ^ ' . • 

The underlying assumptions for research into rating accuracy are 
that persons differ with respect to their ability to accurately assess 
performance and that there- is consistency in their characteristic 
rating responses. Indirect research evidence is available* to suppbrt 
these assumptions; Wiley (1959) and Wiley, Harber, and Giorgia (1959) 
repoxtedj^atudies based on rater ' s estimations of the qualifications . 
necessary >tb*t^ various jobs. They x:oncluded that rater differences do 
exist in; a consistent enough fashion to justify their measurement^j^, 

. • *' 

A tinal rather critical assumption, whieh we-. will .investigate, ia_ _. 
that rdter accuracy is a' generalized* ability . That is, we are assuming 
that the -accuracy of ratings will be maintained across traits or tasks 

'and ratees. Mullins'and Force (1962) have, gathered evidence which 
supports this assumption. Using a sample of inewerienced craters , they^ 
found that tl^ capacity to evaluate verbal ability, was directly related 
to the ability to evaluate carefulness. However, the statistical 
evidence obtained In support of this relationsljip was rather weak, in 
opposition to the assumption that rater accuracy is a generalized 
ability, Allpo^f (1937)- has indicated /hat "the ability t;o judge is 
neither entirely specific nor entirely general , but that it is probably 
more of an error to assume that it is entirely specific;" Taft (1955) 
agrees a^d' goes further to indicate that the validity of the assumption 
that rating accuracy is generalizable is dependent on a set of factors 
which include the subject rated, the traits employed, and the reliabili- 

rty of /the criterion of accuracy. Since differences .of bpinipn do exist 
as to whether o^ not this is a justifiable assumption, it is prudent 
to reserve judgment until further clarifying research has been ' accom- 
plished* *^ 

Obviously, the idajor problem with research ipto the nature-af • 
rating accuracy is the establishmen^t of a suitable criterion. That Is, 
a mbre ultimate measure of Che tr^^Jf^ judged must be obtained ad<i 'employed 
as a yardstick to determine the accuracy of the' judgments made^ by 
various raters. In some research, pooled Jrudgments of the rated trait 
have served as the brasis for determining accuracy (Adams, 1927; 
Ferguson 1949; Greene, 1948; Wiley & Jenkins, 1964). However, as T^ft 
(1955) has pointed out, with this technique there is the possibility 
that we ate actually measuring the extent to which raters conform to 
the group consensus or display the same biases as the criterion judges 
rather than measuring rater accuracy. Other studies employed- more 
objective criteria to evaluate accuracy. \Vemon (1933) used a combina- 
tion of independent ratings and test measures of the rated trait. * 
Norman (1953) and Gordon (1970, 1972) measured accuracy in terms of the ; 



agreement between ratings and behavioral records. To circumvent the 
difficulties associated with uaing pooled judgments as a criterion of 
accuracy, we intend to use paper-*and-pencil te8t$ as d standard. 

Our efforts will begin with a replication and extension of* 
research performed by Mullins and Force (1962). In thia( study, differ- 
ences between estiipated and actual scores on a vocabulary test, served 
as the criterion of rater accuracy. -That is, subjects estimated their 
pfeers* scores on a vocabulary test after being infprmed of the average 
and range of scores for the groiJl^. For each rater i the differences 
between their estimates ai^d th^ actual scores were averaged across^, 
rate^es an.d served as the basis for classifying the rater as either 
:accurate or inaccurate. Tt was hypothesized that ±i raters were 
coi^rectly identified, the correlations between ratings of a behavioral ' 
trait (carefulness) and test measures of the trait wc^uld be greater for 
the accurate^ than for the inaccurate raters. The res'ults of the data 
analy^i/3 Supported this h3rpothesis. 

In the extension of this study, we will manipulate the criteria 
used for identifying accurate raters. Differences between estimated 
and actual scorJ^s on a Jesti of verbal ability ^djjn-.:^s^8t of a less^ 
observable phenomenon, ^thematics ability (and a<np^bMtion of the 
two) will be investigated as a basis of determln^nij^ii!^ accuracy.. 
In adVlltibn, we will confirm our tentative IdentiiihfciiKijtion of raters as 
either accurate or inaccurate on the basis of multiple traits. Not 
only will'ratings and test measures of carefulness be con^ared as > 
before, but also we will compare ratings and test measures of 
4ecisiveness, a trait less subject to obdein^atioh than carefulness.- 

The last phase of the extension to the Mullins and Forde study 
will involve an attempt to predict rater accuracy. Using averaged 
differences between estimated and actual scores on tests of verbal 
and quantitative ability as the '^criterion, wei will determine the 
predictive efficiency of a set of variables hypothesized to be related 
to rater accuracy. The predictars will include measures of self • 
confidence, gregariousness,surgency,^ and compulsivity. , 

The potential payoff for. this type of research is great. Further 
down the road, we plan studies to determine if rater accuracy can.be 
increased by training. In addition,. we plan to investigate the 
possibility of statistically manipulating ratings in order to Increase 
their accuracy. Obviously, we have just opened the lid oh this type of 
research, and a lot of hard thinking must be accomplished to work out 
the detrails and overcome the obstacles.' Neve^rtheles8, we have confidence 
in this approach and feel that it will make a significant contribution, 
to the staterof-the-art . ' . \ ' 
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InttbductioA 




For wSaiy years, much of the research concerning the content of 
evaluation instruments has focused on the relative merit of behaviorallyr 
based and traitroriented rating scales for the evaluation of job per- 
formance. One impetus foj: this research was the introduction by Smith 
8tad Kendall (1963) of a technique for the development of behavjLora^ly 
tochored iw^aiiits. Basically , ttife procedure entailed having p^^ople 
familiar wth la particular job sii:uation develop broad characteristics 
or factors^whdLch cover all aspects of the job. feehaviaral exaaiplcs 
are then - developed to exelnq)lify high and low performance points for V 
each ciiaracteristic as well as moderate performance points within the/ j 
two extreifes. ThesW ibehavioral examples are then written aa expecta-^^^^ y 
tiona of . spedif ic be|^aviors and re-evaluated by independent judges. 
Only behavioral examples which are reliably judg^fed as representing a 

. particular- level of pei^fopnance on the same charafcteristic aire ' * 

>• ihcluded In-^the /final evaWation -J^rm. 

■ , ' ■ , V • ' 

Since its/ajitroduction, the Smith and Kendall technique had been 
applied ^d [evaluated in a number of settings Tjoth in the field and the . 
laboratory. Its pqpularity is probably a result of ^ the generally 
accepted vi^oint that it is psychometricfilly better to evaluate 
job performance using factots that are bas^ on specific behaviors 
'rather than' factors baseii on personality traits. 

The primary problem faced by someone' trying to develop releVtot 
performance factors for use \in a large, ponplex organization is^tbe 
time and expense involved in using soinething like the Smith and Kendall^ 
technique for the wide range of jobs encountered. The basic. quest ion 
that needs to b/ answered is whether^bjective, job specific factors 
are psychometri^my superior td more. suWective personal-trait factors 
in^ the evaluatlonofMob performance. If the job-specific (factors prove 
to be statistically sAperioV, then the practical sl^iflcancte of the 
difference qiust be great enough to Justify the costN^ involved ^in develop- 
ing the more objective factors^ 
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Relevant i^Research ' / ^ 

\ ^ y#,^F; 

, In a review of the literature on the content of evaluatioh instru- 
ments, Kavanpgh (1971) stated that the t^rend in this. area of research 

^has been toward the use of objective and meaaurable traits as opposed 
to personality traits in performance evaluation. J|e goes on to say 
that detspite the fact that the objective traits were gaining in 
popularity, the empirical evidence in suppott/of objective traits was 
not strong Enough to warrant their use in exclusion of persciiality 
traits. Kavanagh .further istated that the idea of. an -ultimate criterion 
of job performance is a behavioral construct, and, thdr^f ore, construct 

.validation should be the method by which immediate measures of 
perforftance are evaluated in terms of their relevance to the ultimate 
criterion, He then catnegorized the relevant literature according pq^ 
the method of validation used in each study and rfeviewed them by 
category. 



One group of studies used inter-rater or re-rating reliability as 
.one method of Validation. In general, the more objective traits proved 
to be rated somewhat more reliably ,-1but the results were certainly not 
unequivocal, and many subjective personality traits also shbjwed a high 
degree of reliability. JCavanagh points out that validity by consensual 
agreement really a form of convergent validity and, according to 
Campbell (1960), both convergent ?and discriminant validity are needed 
for establishing construct validity. . 

Another group of studies reviei^d by Kavanagh used validation 
against an^other criterion to determine the relevance of rating scale 
content. Kavanagh says that this approach is valid as long as the 
criterion used for valida^tiorj, is closer to the ultimate criterion than 
the ratings themselves. The problem ^^s that this decision is usually 
judgmental leather than empirical. (This touches upon the probl^ ' 
mentioned in the paper by Dr. Mullins and Lt Col Ratlif f with respect 
to differentiating between predictor and criterion and the ^f act ■ that 
what we realsly'^ave is a measurement problem.)/ In the group of^ studies 
reviewed, the more objective" traits generally showed a somewhat higher 
Validation against ariother criterion, but again the results Wer^ 
inconclusivjs. y^Some studies showed personal traits to be better tl^an 
th^ more* objective factors, and personal traits accounted for at least 
some of the variance in most of the studies. " 

The third group of studies reviewed by Kavanagh used validation by 
the multitrait-multimetho'd matrix introduced by Campbell and Fiske 
(1959). The use of this scheme allows one to obtain measures of both 
convergent and discriminant validity so that overall construct validity 
of rating scales can be better inferred. The results of the studies ^ 
reviewed again proved to be equivocal with both objective and personal 
.traits being psychometi:leally sujU^rior in different situations. 

■ 1- ■ ■ " ■ ■ 
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In concluding his article, Kavanagh points out that based upon the 
current literature, no absolute decision can be^ reached with resp&c&t to 
the superiorfly of one type of rating factor over the other in all j 
situations. Kavanagh recognizes the basic problem of the riBlative ; 
efficiency of x)bjective traits versus the amount of time spent in their 
development when he says, "objective job-oriented traits seem at present^ 
to have a slight edge, but fhe problem of situational specificity and 
additional time question the practical usefulness of this purist . 
approach" (p. 663). 



Since' the' Kavanagh article, very fei/^«tudies h^^^ been;: .done which ' 
specifically compare behavlorally-based and personality-oriented^rati^g 
factoids. Campbell, Dunnette, Arveryj and Hellervilc ^1973) evaluated 
behaviMsa lly bas ed factors which were developed for flepartment store 
managers using a modified form of the Smith-Kendall tejfhnique. ;They 
found that when the factor scales were anchqred with beKayioral,; 
expectations, the ratings showed less halo, leniency, and method 
variance than when only broad definitions of the factors were used. 
While personality trait factors per.se were not used in this studx, 
it does shbw the decrease in* the efficiency of b^haviorally-b^ed 
scales ^en they are not anchored with behavioral expectatioij^tate- 
mentS. the authors also mention that "the managers who develo^ped . these 
V scales invented a tremendous amount of^ e^fj^ort in the pr'ocess" (p{ 22<)^ 

/ Neither, of the two major studies which^spec^ically compared 
behaviorallr-based and personality-oriented factors found reason to 
overwhelmingly support either type of rating scale. ^Bumaska and ^ 
flollmann (4974) compar€rtrT?^ree rating sc^le formats using analysis of 
' variance td^iques.y^heyd^ Stoith-Kendall t^e behaviorally 
anchored* scales and/^cales witk the same dimensiojp but without tjie , 
behavioral anchbr^ just as CamJbell et al. (1973r*Lad done. ^ 
Additionally, Bumaska and Hol^mann compared both of those formats with 
scales made, frpm a priori •^detemjjig.d factors and no behavipral anchors. 

Unlike Campbell et al. (1973), Bumask& and Hollmann found that 
behavioral an^choting did not enhance the' psychometric p;poperties o£ 
the systematically devejLoped scales. While^ithey did find that the 
Smith-Kendall scales wd3fe somewhat less suSteplrible td lehienGy ^oir-^ 
and allowed greater* differentiation between ratees, they » conclude^ thrat 
* "there is no evidence for the superiority of one format" (p. 311). 

They based this conclusion on the fact that all three formats contained 
composite halo and leniency error leading to small interratee discrimi- 
nation. "Ihis fact led Bumaska and Hollman to question the a^bility 
of even systematically developed scales to diminish raters ' ^ tendency to 
rate according to an overall motivational component similar to Spearman's 
"g" factor?^ ' ' 

Borm^ and Dunnette (1975) studied essentially the same variables 
^ that Bumaska and Hollmann had studied. The behavioral scaleaii^ were ^ 
developed to evaluate the performance ofe^aval officers, and the a priori 



trait-oriented factors wer^ those already in use on the Naval Officer 
Fitness Report, They found that the bfehaviorally-based factors with 
anchored stales were psychometrically superior to the other two rating 
formats on measures of leniency, differentiation among ratees, halo, 
-and Interrater agreement . However, the magnitude of the diffeVence? 
was small, only sometimes reaching statistical significance. The 
authors state that probably j^less than 5% of the variance in the " 
dependent yariables can he accounted for by differences' 1^ the rating 
formats. Noting the amount ,of time ahd effort required in developing 
bohaviorally-based factors, the authors question the usefulness of the 
Smith-Kendall procedure if the scales 4re. only going to be used for 
performance ratings. They conclude that "at present little empirical 
evidence exists supporting the incremental validity of performance 
ratings made using behavioral scales" (p. 565). 

Tl>e consensus of tBe literature to date is about the same as it 
was at the time of the KavanagA (1^1) review. Behaviorally-oriented,, 
job specific rating factors are generally shown to be somewhat' 
psychometrically superior XP the more subjective personality trait 
.factors. However, even wheh the sytematica:|.iy developed scales ai:% 
shown to.' b^ more efficient, t-he differences betv^en rating formats , " 
are usually small. A real quest 1911 ^st ill exists as to whether the 
superiority of the job spfecific factors, although statistically I 
significant, is of enough practical significance to warrant 'the time 
and effort involved in their development. 

Current. Research % 
^ ' ■ ■ ■ ' ' 

The Air Force Human Resources Laboratory has recently J)egun a series 
of studies at the Air Training Command Noncommissioned Officers (NCO) 
Academy. The purpose of these studies will be to analyze ^the content 
issue in an Air Force environment. Of particular importance iWLll be. ') 
determining the operational impact of various psychometric differences 
in sets of rating factors. Hopefully, methodplo'gies developed and 
analyzed in th,X8 particular setting can later , be uqed to develop 
yiriterion insfijuments.^'for ^^de range of^A^r Force jabe. 

.The *NCO Academy* at Lackland AFB provides in-residence professional 
military education for Air Force NCOs in the graces of E6 and E7. The 
NCO Academy classes last for about 6 weeks. Typically, there are 135 
students per class,' and, they are divided into 9 seminarsf with 15" 
students in each seminar. ' ^ 

The 'general strategy of the studies will be to have the students at 
the NCO Academy render ratings on the other sjtudents^ in their seminar 
group. Means, standard deviations, pooled variance, and bther traditional 
analyses will Indicate the degree to which the rating factors are subject 
to rater errors stich as leniency, and halo. Also, the instructors will 
be asked to rate the students so tha^t the convergent and discriminant 



validity^of the factors can be determine^ by use of the multitralt- 
mul tir at er .matrix. ' ' ' * • 

In kddition to the traditional analyses done to determine the 
psychometric properties of the 'factors, pfofileTS will be iaade up on each 
person based. upon his or her average rating on each factor. These 
profiles will be returned to. the students, and they will be 'asked to 
Identify the people in theirlpeminar groups from their profiles. They 
will also be asked to rank order the profiles according to how wpll 
they 'think a person with a particular profile will perform at the NCO 
Academy. Analysis of these data will show the number of ^ times each 
person correctly identifies a c^assma'te from his profile of scores. 
Also, correlations will be generated to show tlie degree of association 
between the rank ordering of the profiles and the actual rank ordering 
of students at the eitd of the class. These additional analyses w;Lll . 
yield some measurement of the practical significance af differences 
in psychometric properties of rating factors. 

Thus far, two studies have been completed at the JNCO Academy. The 
first was a pilot study to determine and correct me^odological problems 
thaf wouW be encountered. The most significant result from the first 
study was ^he identification of a set of 10 rating f lectors which the; 
students agreed upon as being appropriate for evalu^^ting their perfor- 
mance at the academy- ' ^ ^ 
^ , ■ ■'' (. . • 
The second study has recently been completed^ and the data are 
currently being analyzed- Table 1 shows the results of some preliminary- 
analyses that were compiled from the data. While these results are in 
rough form and need to be analyzed iiluch more thoroughly, they do give 
an example of the type of information that might be gained with our 
experimental design- r 

In this particular study, three sets of 10 rating factory are being 
compared- TVo sets of factors come from a survey which was sent to Aii;: 
Force NCOs in the gradi^s of E7, E8, and E9. --These NCps were asked wha^ 
factoi:s they thought should be used -tp evaluate th6m oi^^tJ|iei The 
tf>p.-^id factors and the bottom 10 factors chosen by survey respondents 
make UP two of the sets of factors used in this study., The tljird set 
of factors is made up t>f those factors chosen by the Students at the 
academy as being, appropriate for evaluating their performance. Each 
set of 10 rating factors was' assigned to 3 of the 9 seminar~grbaps at 
the academy- The students then used a rati^ig form containing those 10 
factors to rate the other members of their seminar group. They rated ^ 
each student with each factor using a 5-point scale labeled "Far B^low 
/v^rage," "Below Average," "Average," "Above Average,", and "Well Above 
Average, " ^ 

Using mean ratings, across all factors as a measure^4lc leniency erro^ 
Table 1 shows tjiat ratings using the student generated factors were ~ 
less susceptib*le to leniency eifror than either o£ the survey generated 
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factors. Pf the survey generated, factors,; the bottom JO factors were, 
superior to th^ top 10 factors. This same relationship appe^s when' 
considering the standard devi^^atjLons of the factor scqrea, which is an 
indication of the degree to which the ratings differentiate among 
ratees. Thesfe are th^ypes of analyses appearing in the literature 
today, and sometimes differences as small as tttose showi> in Table T 
are used to supp;ort J^e superiority :.of one type of rating factor over 
another. . / * * 



Table 1. Comparison of Three Sets of Rating Factors 





, ' Student 


Survey 


Survey 




Generated 


Top Ten 


Bottom Ten 


Means ^ 


3.56 


3.74. . 


3.63 


Standard Deylatlons 


• -42 ^ 


.33 




Hits 


3.42 ^ 


2.17 


2.68 


Correlations 


. -43 


.42 


.39 



. The next step in this study was to develop a profile on each person 
based upon his or her mean ratings on all factors. These profiles were 
then returned to the students, and each student was asked to identify 
the other students. in the seminar group from their profile scores. In 
Table 1, "hits" are used to designate the mean number of times people 
were correctly identified using each of €he three sets of -factors. It 
can be seen' that students using .the. student generated factors averaged 
identifying 3.42 out of 15 seminar members correctly while those using 
the survey bottom 10 factors identified 2.68, and those using the 
survey top 10 factors identified only 2.17 correctly. This analysis ' 
gives an indication that the Relationships shown with the mean and, 
standard deviation scores have an influence on how well people can be 
separated and identified in an operational sense. 

if dtfferehe.latton among iratees'wefe th\^goal ot« the rating instru- 
ment i then it appears that the Student generated factors are superior 
to the survey bottom 10. factors which are in turn superior to the 
survey top 10 factors. It also appears that the jneasurement of means 
and/or standard deviations of the factor scores^would give a reliable 
Indication of the relative superiority of the &ets pf factors without 
going through the identification step. 

However J simply identification and differentiation is rarely the 
goal of a rating instrument. . Instead, it is usually used to"^ judge how 
well a person performs his job. If a rating instrument did give an 
accurate assessment of how well a job was performed, then differentia- 
tion among ratees would certainly be achieved, assuming the ratees ' 
performed t;he job at different lei^ls of ability. Hbwever , even^ 
though differentiation among ratees should result frpm using a valid 




raFing instrument^ the fact thaj: differentiation occurs is nbt ^' 
sufficient evidence for the instrument tp *be 'considered valid for . 
evaluating job performance, ^vgood example is shown in the present^ - 
study*^» . " - , ' . 

.. ■ ' y , .■ ^ . ■/ ' . - ^ 

The students were asked to rank order Jthe profiles according to 
how well they felt a pntrson with a particular profile would perform ' \ 
while at the NCO Academy. Table 1 'shows the average correlations 
* between the rank ordering of the proves and the actual taAk ordering 
of the students at the end of tfie class based xipoir their final grades. 
It can^^be seen that the differences between correlations ^re ii^signifi- ^ 
cant and that one set of fetors seems to l?e just. ^ about as good as ' ^ ^ 
^another for actually predicting the performance of a ratee. llierefor^^ 
while one set- of i^ctors is psychometrically superior to^another set, ^ 
when j-udged against the criterion of actual job performance,- the \ ^ ' 
superiority of any one set of faptors disappears. This seems to point 
out the importance of these additional analyses, in trying to detendLne 
the relative effectiveness* of a s^t of .factors, in an operational . 
settlnfe.' While one factor or- one set of factors may be psychometrically • 
superior to toother, the practical significance-of the differences should 
be investigated before an operational decision is made. . . . - 
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Whenever ratings are c(Dllected from jiupervisors in an operational 
setting, particularly if the ratee must be .m^de, aware of .the rating 
"given tt) himy two urtdesirable" -consequences usually bccur^ lSiB ratings" 
become /'Inflated" (tlxat is, the mean approaches the uprper ;:ange limit) , 
and the variance becomes coii^ressed (that Is, everybody gets essen- 
tl^Wy the same score) . t^ The major reason these two effects occur is 
that the supervisor is required ^to perform mutually* lncoiiq>atlble # 
acts*-- he mCist support his people and he must critically' evaluate 'his 
people) It id very.^diff icult to do both, sb the reaction of most; 
supervisors, at least in large o3rganizations, is t:jo try to see that 
his people get a better than average chance at promotion. As a ' 
consequence, rating^ creep up an*d accuracy falls off. ' 

The effects just mentioned occur when operatioflal raClngs are 
collected n&rmatively. Normative scores are those which prdduca < 
norms, so that 'comparisons ^nay be made -across individuals in a group^ ^ - 
A ratee 's score may^be expressed ias a percentile, showing his standing- 
In rejLation to other members of the group. 

There is anothet: }&lnd of dat§ which c^ be collected in a manner 
that avtOfliAlcally mininB,d^^ mefps and the variance 

compression customarily found' when ho i^tlV^ data are used for 
Operational' ratings. Rating data can be trollected in a manner (called^ 
••ipsative" ratljigs) such that cliaracterlstlcs within an Individual are 
rated relative only to other characteri8tic;;s^f the same Indiyidaal. 
Tills method produces a profile of the characteristic^, showing wlilch 
of the ratee 's ttaits are his stronger ones and which are his weaker 
ones. Nothing can* be inferred about the streia^th of any of the ratee 's 
characteristics, as compared with the strength of some other >ratee on 
that characteristic. If a list of characteristics is rankjpd for a 
particular ratee from sl»:ongest to weak6^t, there is abaolutely no 
problem with mean Inflation and variance compression because the mean 
and the variance are fixed mechanically by the ranking process. 

Qowevet, Ipsative f&ilclngs * (relative rankings of characteristics 
within the IndiSrL^ual ratee) are useless for operational evaluative 
purposes unless they can be treated in some .way so that the ^information' 



on each ratee caii be compared with that for other ratees. . For example, 
it doea-little good to know that, say, creativity is Jpe's strongest 
charact«istic and Mary.'s weakest characteristic ^ if we kre trying to 
conroaremry with Joe. It. is entirely possible that Joe is generally 
so Inept'iMid l&ry so generally expert that Mary's creativity^ 
although -^it^s her weakest* characteristic , may still.be stroliger than 
Joe's creativity j?*a^ough it is his strongest •characterTstic. 

We can see t^jPfeiys to convert ipsative rating data so that 
comparisons can bl^de across v individuals. One of th^se wajs^is by 
computing an index of wbrker-job matcH. -If is obtained simply enough 
by correlating the ranking of characteristics describing ,the individual 
with a simiUr ra|||B.ng of^lje.. samfe characteristics as they are • 
required by the job, as ^ojm.in Figure 1.- The ranking 6t job 
charactiristics should be„ Informed by j someone other than the one wljp. 

-ranks these characteristips in the worker. The correlation coefficient 
may be used in ,raw or converted f^rm ad ah index of worker-job match. 
It seems likely that if two workers are of the same level of general J 
competei?ce averaged across ^separate applicable skills and traits, the 

-one vAibse pattern of characteristics,ja6|t closely resembles the . 
pattern required by the job willh^f thejWfe.wbo performs better. ,The 
worker-job match^ index^can included iAth whatever other variables, 
are available as candidates for criteribA composites. 

Rankiffjs 



Carefulness 
Responsiveness 
Initiative 
Creativity 
Tolerance of stress 
Cooperation 
Adaptability 
Writing abil: 
Speaking ability 
Reasoning abi 



Mary > 

1 

2 . 

3 

4 

■ 5 

6- 

7 

8 

9 
10 



Job X 

3 
1 " 
4 
5 
2 
9 
7 
10 
8 
6. 



Rho -72 




Figure 1. The computation of a worker-job match • 
;Lndex. f : 

The worker- job match index yields information which should prove 
^tlseKl. However, another ^treatment is po.ssible, and we plan to 
investigate itf. A worker '^vpattem of characteristics could correlate 
perfeQtly with the pattern required by a job, but he could be so ^ 
generally weak that he performs poorly; or he may possess such all- 
around competence that he does well despite a poor job-match index. 
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All the job-match index reveals is . the congtuence of patterns of 
characteristics between the worker and the -^job. It provides no infor- 
mation at all on the relative strengths of two workers on any of --the 
characteristics. This- is n9t a serious problem if the worker-job match 
Index can be included as just one^ component in a composite criterion 
SiXon£ with at leOTt one pertinent liormtive variable. The normative 
variable will establish a level of general competence, and the wordier- 
job match index will be weighted to the extent that pattern congruity 
is important. But there are 'some situations in which tests arjB 
disliked as a means of worker appraisal. In these situations^ if only 
one test can be administered or if a score from a previously adminis- 
tered test can be obtained from the files or ^if any kind of reasonable 
normative variable is avaiilable on a large number of workers^ then a ^ 
situation can be set up so that* an anchoring system cain be employed. 
The anchoring variable is common to the workers being evaluated and is 
ranked along, with the other characteristics. The other ranked, . 
characteristics will fall* above or below the anchor variable according 
to how they are ranked for a particular worker. Standard -scores 
(percentiles, z-stores, or something similar) can then be^^%i3signed to 
each of the rankedV characteristics soci that comparisons can be made 
across individuals^] on each of the characteristics. 

The conversion to standard scores required for this approach was 
mentioned glibly in the previous paragraph , as if the- problems 
surrounding this important stfep were all solved. They have not been. 
We belie vg we dan produce a .crude system of conversion now, but it will 
need much sharpening. The production of standard scores such as these 
involves some knoj^^ledge about \ntra-individual variability across 
characteristics. We know tbaf there is a fairly strong tendency for 
positively regarded chai'acteristics, both intellectual and non- ' 
-intellectual, to be intercorrelated, (Horn,; 1968) . To the extent that 
these characteristics are correlated, to that .extent the intra-individual 
variability will be reduced, and the more accurately standard scores, 
can be assigned to th^ ipsatively ranked- characteristics . Our first 
cut will be a very primitive conversion system bas^d on distributions 
of intra-individual variability obtained on , "Other groups and other ^ 
characteristics (see Figure 2). The standard scores issuing from this 
conversion system certainly will not be exact, but they should be 
accurate enough to yield evaluations which, because of t^eir relative 
immunity to deliberate biasing by the supervisor,' should prove more useful 
than the system ordinarily used . 

These standard scores will then be in a normative form, and they 
become possible candidates, appropriately weighted, to form a composite 
criterion score. The weights would be obtained by using the variables 
as predictors of some more ultimate criterion, or of some criterion 
which may be obtained experimentally but not operationally. It shou^ 
be. obvious that the anchor variable system is not substantially differ- 
ent from a system using the .worker- job ^gtch index in conjunction with 
at least one normative score oi^ art appropriate variable. We plan to 
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compare^ both these systems. 



Pete's 

CharacterlstliC Ranking 

Carefulness 1 

Responsiveness 2 

Initiative ^ 3 

Creativity 4 

Tolerance of stress 5 

Cooperation 6 

Adaptability . 7 

Reasoning ability 8 

Writing ^bility 9^ 

Speaking/ability 10 



Percentile 



85.5 [75 + 




X 15)] 



Range (flrom studying other 
characteristics, other 
populations) = 15 -pjercentile 
points 

^ . 75 (measured anchors variable) 
72 [75 - (2/io X 15)] 



soning ^ility test score = 75th percentile. 

Calculation of normative values for ipsative .rankings, 
using an anchor variable. 



Figure 2. 



Perhaps you will remember from the line of intellectual develop- 
ment we discussed yesterday that it is our conviction that there is no 
single criterion, immutable ^d all-encompassing. There are innumerable 
points of intellectual development from birth to death, each a little 
more complex than the previous one. It is conceivable that each of 
these points may be eventually measurable, but each -is so complex that 
it is unlikely that any point ever will be completely measured for any 
practical purpose other than research. A criterion is a measure, taken 
at a desired point along the development line, of that portion ot 
Intellectual development which seems to the investigator to represent 
those functions with which he is most directly concerned. That point 
may serve both as a criteriori'for predictors consisting of earlier 
points and as a predictor for criteria taken at* later points. With 
this orientati<on, it is quite reasonable to "validate" criterion 
measures against other criterion measures. 

Because of the nature pf this system, many studies will have to be 
done before we can say with any confidence that the system is worth the 
effort. The following questions, and many others, will have to be 
answered: ^ 

1. Is the proposed system a ..better way of collecting evaluation 
'information than the simplex? one of collecting normative rating data? 

It appears that it should be better, but one cannot know for sure until 
the system has been subjected to empirical scrutiny. 

2. The efficiency of any evaluative scheme depends in large part 
on the particular variables selected to enter the system.' What is the 



^ ♦ 

best way to selects the variables needed? Captain Curton addressed 
this problem* In his presentation. 

•> ' 

3. What weights should the various components of the system 
take? For examplp, is the workers-job match index the most impoYttot ^ 
consideration, or the least ^important , or somewhere in-between? 

These short statements of research questions fStually Involve very 
long and very difficult researdh work. We don't krtow how ^good the 
s/stem will prove to be, but we believe that it should ^t least be 
better than the system of collecting rating data which is currently 
used so widely. ^ 
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JJow ^nd then a predictor battery is required in a situation where 
nb criterion Bxists. This kind of situation can arise when a new , 
specialty is bom and there are no subjects currently performing in the 
specialty; or when the specialty is so thinly manned or unusual that 
requisite numbers of perfonners for validation studies simply do'hot 
exist; or when maoagement needs a predictor battery su^tantially 
sooner than one Jh be' produced by tlfe^^assioal validation technique. 
Seven years ago/AFHRL developed two methods for furnishing a using 
agency with a predictor battery immediately upon request, if the using 
agency could provide a team of subject matter experts for about a half- 
dav\ effort (Mullins & Usdin, 1970). As part of the research work 
coSiected with this effort , a comparison was made between the battery 
furtiished in the classical way and the batteries furnished with these 
two synthetic metho'Hs, and it appeared that there was no practical 
difference among the batteries in their efficiency :^n predicting an 
empirical criterion. The two techniques are called the R-technlqUe 
the M-technique, and both are based on the assumption that synthetic 
criterion vectors can be devised which are similar enough to the. ^ 
empirical criterion vectdr so that' weights proti^ced for the predator 
variables in the synthetic criterion situation will be essentially the 
sime as predictor weights generated^^ the classical empirical situation. 
The focus of our previous research was almost entirely on 5he utility 
of predictor weights produced synthetically, but we believe now that a 
good estimate of the empirical validity coeffl-cient can also be produced 
synthetically. Both synthetic techniques make a few other Important 
assumptions: 

1 It is assumed tl?at decisions have already been mad^, or can be 
ma-de, about which predictor variables will enter- the predictor battery. 
This means that the variables are available off the shelf, or th^t the 
preliminary',work on the variables (concerning item analyses, reliabili- 
ties, etc.) has already been accomplished. The predictors are ready to 
go-Ill that" remains is the problem of relative weights for the separate 
predictor variables » 

2. It is assumga' that the requesting agency can furnish a£ least 
three subject matter specialists wfio are thoroughly conversant, with the . 
demands of the Job to be performed, and that the producing agency can 



furnish at least thi^ee test specialists who thoroughly understand the 
te^Jts In the predictor battery,' or who can be made to understand them 
1)y a brief statlfetlftal description of their characteristics. 

3. If one Is /^oing research on the techniques, It is assumed that 
some ei]q>lrlcal criterion will be available so that the weighted coiiq>oS'- 
Ite scores generated synthetically can be conq)ared for efficiency with 

♦the weighted composite sVore produced fenq)lrlcally. If one is not doing 
research, but simply producing a battery for a using agency, this 
as9umpt:lon is nqt/ atjeqlutely necessary, but empirical demonstration of 
the degree of e&i^iency of the synthetic composites is still desirable 

*if a criterion c^n be'dbtained. In the latter case , obviously , the 
s^thetlcally ptdduaed prediction composited can be considered as a 
stbp-gap measure/ until eiiq>lrlcal weighting becomes a possibility. ! 

R-Technlquy 

The R-technlque requires that the subject matter specialists and/or 
the tes^ expert^ (the judges) rate 100 subjects on how well the judges 
believe, from Studying the subjects' scores on the predictor variables, 
the subjects vill perfiorm on the job of interest.*^ The 100 subjects 
need not be rpal people — they can be made up^i If they are real people, 
tbey should be selected from available subjects in such a way thlit 
considerable/spread is introduced into the profiles which are studied by 
the judges./ When the 100 subjects have b6en rated, the ratings are ixsed 
as a crlterlQn against which all the predictors for these 100 subjects 
are correlated* . The multiple correlation, of course, produces a set ot 
weights for' the predictor variables which are then used to calculate a 
predictor composite for each of the subjects one Is lej/fj^y. Interested 
in. ' 

M-Technlque ' ^ 

The M-technique is also a way arriving at relative weJ^ghts for 
the varj-ous predictors, so that a prediction composite can be calculated^ 
for the subjects of interest. The judges also provide the information 
for this technique, but the information is of rather'a different kind* 
Instead of eaJtimates of ^Likely performance of a sanqple of dummy subjects, 
the M-technlque produces estimates of relative importance of variables 
comprising the predictor set. The predictor variables, are factor- 
analyzed, the resultant factors are explained to the judges, and, the 
judges are told to dlst]rll)ute 100 points among the factors according tbL. 
how Important the judge^ believe the factory are in producing good job 
performance. 

If a real cricerluu were available, it could be Introduced into 
the factor analysis and its currelatic»is with any predictor ]|^uld be 
reproducible by multiplying the criterion's factor loadings by the 
corresponding factor loadings of the predictor and then summing these 
products across all factors. In this way, a validljty vector can be 
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produced from a table of factor loadings. But our problem involves a 
fi^ituation where no criterion exists. 

Sinrce no criterion exists, and consequently no criterion factor 
loadings exist, the square roots of the distributions of 100 ppints 
among the factors, by the judges must substitute for the loadings^ iThen, 
by the arithmetic^ described above, an eistimated^ valildity vector 'is . . 
produced and, from this, weights for the various predictors are obtained 
The details of 1>oth techniques ^for producing weight's are contained in 
the Mullins and Usdin report. » 

In the previous work done on these techniques, a criterion of 
technical school grades was available for 1,000 subjects from each of 
four sahools, one in each of the Air Force's four aptitude areas 
- (mechanical, administrative, general, an/T electronic) An empiricaj. 
composite was computed in the usual way.\ Each of the four samples was 
randomly split into two 500-man subsampld^s. One df these subsamples 
was used to generate weights, and the othVr was used to cross-validate. 
The cross-validated R was used as a reference'^^p^nt , and, within each 
of the four* cros^-validation subsamples, other pt^iiLction composites 
were computed for each subject, generated by the s/nthetic approaches. 
In most instances, the syntlj'etically generated- con^rtisites produced 
validities which, for practical purposes, were not different from those 
produced in the usual empirical Way. In only bne school was the pre- 
diction o/ the empirical criterion significantly worse using the 
synthetically generated composites, and that difference was barely^ 
significant at the .01 level. 

At the present time, two further investigations of these techniques 
are under way. One of these investigations is analogous- to the previous 
study in that technical grades are once ag,ain the criterion of the 
prediction battery. The other on-going investigation expands the 
application of the techniques to the prediction of ratings of on-the-job 
perf ormanoe . ' 

If the replication work ctirrently under, way ' produces results as 
encouraging as the previous study, this approach to validation of our 
Air Force predictor tests will form .at least an interim position while 
.the 'search for a satisfactory criterion continues. ' , 
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Introduction 



The title of my paper Is "What Is the Value of Aptitude^ Tests?" 
No one could feel comfortable dealing with such a broad and cbntrover- > 
slal topic — especlally^n front of a group of professionals in the 
testing business — but I feel the topic needs to be discussed and"^ 
debated- . * ' 

Rec6fitry,' some individuals have gone so far as to suggest that 
testing ^ done aWay \>€th altogether. Good heavens I Haven't we demon- 
strated for decades the value of tests in personnel selection and 
classification? Of course we must deal with reasonable questions 
concerning the fairness and job relevance of tests, but surely all 
military managers should see that tests are ^^dispensable . • ' 

Evidently, we have done an inadequate job in merchandising our 
product. For this reason, I would like to look at the manner in which 
we have attempted to see the value of tests and s^e if there are holes 
in our case. Then; I will venture to make a few suggestions for re- 
orientation of pur sales pitch ^and research strategies • 

Present Defense 

• 

As I review the situation, I find that we have defended the value 
of altitude tests on three grounds: (1) tjieir ability to predict 
performance on.^he job; (2) their ability to predict attrition in 
training; and "^(S) their ability to predict course grades. I would 
like to. consider these one at a time. 

' ' % 

Prediction of Job Performance ^ 

First, let's consider job performance. Now let's be honest about 
it. We really don't have overpowering evidence that our tests predict^ 
job performance, and informed managers and operators know that we don't. 
Many of these individuals are of the opinion that the key to productiv- 
ity is not Individual differences in aptitude, but good management. 
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Experience teaches them that nearly all personnel they deal with- on a 
day-by-^day basis Qould get the job done if they simply applied them- 
selves^ The individual differences tfhey observe are mostly m^ivational, 
or else are not job related. 

Of course, these managers^re right. What they fail to understand 
is that this lack of variance is, to a large extent, the product of 
testing and training. If managers in, an electronics maintenance 
occupation were to receive a random .sample of untrained personnel out 
Q^^the general population and attempt to generate the required skills 
on' the job, I can assure you that they would quickly become acutely . 
aware of individual differences in aptitude. However, this would not 
be an efficient way to run a military service. We use tests to select 
and classify individuals into occupations such that each person has th^ 
capacity to acquire the necessary skillsu^for apceptable job performance. 
The training progrgju^* in turn, is geared to provide each trainee with 
these required ski^^life* If the process is efficient, then, there, is no 
reason why tests should predict performance variance on the job, and 
we should neither make apologies nor hang our heads in shame when such' ^ 
is found to be the, case. ' 

Prediction of Attrition 

The second way we have defended our tests* is by showing how well 
they predict attrition in training. In the Air Force, a washout in 
pi-lot training costs the service thousands of dollars, and ttie claim is 
made that millions of dollars of additional costs are avoided each year 
^ using tests to screen ouf applicants likely to fail in training. On 
the surface, this ^unds like a strong case for JEests. It can be shown 
that with4.n any training class, individuals witl^htgh aptitude scdres 
wash out at a nruch lowet rate than individuals^wi^jmlow^ scores. -fit is 
also true that washouts are. very expensive. HpweVer, it is not Wsy to 
demonstrate that our aptitude tests save money by reducing washout rates. 

Let me show you some data extracted! from the Army Air Forces 
Aviation Psychology Research Report Nbl 2 (DuBois, 1947)."^ ' , ^ 

Table 1. Attrition Rates and Aptitude ';npilt toj Every \^ 
Third Pilot Training Class (44C thru 46G)* 



Aptitude Percent 

Class . N Cutoff - Elimlnees 

44C 12,232 3 -15.5 

44F 9,371 ^3 12.0 

441 K 6,466 4 ' . 19.6 

45A 6,525 4 V21.0 

45D '1,384 4 21.5 

45G 664 ^' 6 27:4 



^Extracted from Report No. 2, *'Tlie Classification Program," Army Air 
Forces Aviation Psychology Program Research Reports , 1947. 
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Table 1 reflects pass/fail data for every third class from 44C through 
45G. In classes 44C and 44F, the cutting score on' the aptitude ' score for 
entry was StanitfM3, and the average attrition rate was 13.9%. In v - 
classes 441, 45i^ and 45D, the cutoff was raised to Stanine 4. However, 
instead of going down, the; attrition rate increased to 20;4%. Finally, 
in class 45G, the cutting point was raised to Stanine 6, yet the attrition 
rajte went up again — clear up to 27.4%. In view of these data, one 
might conclude that attrition in pilot training would be minimized if 
those cases having ,tj^e- least aptitude were entered into training. * 
, ' ' ■ I 1 . ' 

Of course , "this is not true,. . The, fact is that attritiQn rates, were 
controlled by adminis|trati^ve actions,, end ;were ,not dependent on the 
quality of the' input. The number of pilot graduates was determined in 
large part b(y the number of cockpits to be filled. The data shown Jii, 
Table 1 reflect actions taken towardr^he end of the war as the^ number 
of trained pilots became aljundant arid aircraft production was reduce.4. ' 
We have good reason for believing that the • quality of graduates from 
these classes varied, but we cannot demonstrafc^that the use of tests 
saved money by reducing attrition rates. 

We would have even. a more difficult time demonstrating the influence 
of te^ts pn' attrition rates in enlisted courses. The number of graduates 
from such courses is ordinarily programmed months in advance to meet 
operational r^^ire^ents, and fluctuations in input talent produce only 
minor f luctuii^ioirs-^ attrition rates. Duriag periods of loir qualijty 
input, it is not uncommon to increase w^sh-backs and remedial training 
to maii;itain production standards. 

^ . Pass/fail is a very slippery criterion, and attrition rates seem 
to be arbitrarily established. This phenomenon is^ not restricted to 
the military. For example, there are wide variations in the input 
talent to colleges and universities, where attrition ,rates for the 
same courses are essentially equivalent., A washout from MIT or Cal Tech 
CQwld be an honor graduate from certain other colleges and uni^fei::3ities . 
^We seem to be living in a relative world without absolute standards. ^ 
This is one of the problems we face in demonstrating the value of tests. 



In 1957, Dr. Krumboltz and I published a study (Krujjiboltz & Christal, 
957) in which we demonstrated th^t the probability of a^student comf- 
pliting pilot training is a 'function of the aptitude levels oTpthe other 
three students with whom he is grouped under the same instructor. A** 
^student with a Stanine 5 wa^less likely to graduate if he were grouped 
with three students- at the Stanine 9 level than if he were grouped with ^ 
three students at the Stanin|^5 level. ^ - ^ 

Ij^ 1959, ail invfcsblgator In Australia reported a strange and 
M related finding (Want, 1959). In that country,* Air Force and Navy 
pilots were being trained LogeLhei under the same instructors. The Air 
^Force raised ,4:helr entrance requirements,' and the result was that the 



attrition rate for Navy trainees nearly doubled, l^ile ,the level of 
talent of Navy trainees remained constant^ these ^individuals began look- 
ing bad , in comparison with their Air Force counterparts. 

These studies demonstrate that aptitudeHests do- measure differ- 
ences in abilities which arer recognized by instructors. However .we 
will not be able to defend our tests on the basis of their role in 
reducing attrition rates until absolute standards for successful course 
completion a^e implemented and adhered to. 

Prediction . bf ^^ourse Grades 

— — ' J 

A third way we 'have attempted to show the value of tests is in 
^t6rms of their ability to predict final course grades. The statement 
that .aptitude tests predict course grades is irref^itable . liiterally 
•hundreds of ^studies have consistently demonstrated this to be so. To 
prove that we) haven't lost our grip in this respect, I've brought along 
results from pne of the largest- Air Forde validation studies ever 
conducted^ whjtch I will display to you. . i ^ 

We beg^n^with a 380,000-'case population graduating from Air Force 
entry-l^Vel 't(i)ur^es between January 4.969^ and April 1^74. From this 
population, we randomly selected .1,00Q cases^ from each course V when 
available, or. a total sample when data were 'available from fewer than 
i,000 cases. This yielded a tot^l validation sample of slightly more 
than 100,000 cases, rej>resenting graduates from 134 different courses. 

Table 2. ^Validities - (R) of« AQE/ASVAB/AFQT for Course Grades* 
for AFSs with AQE/ASVAB Cutoff at 80th Centile- 
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The validity coefficiernts I will' show are un'cdrre'cted multiple 
correlation coefficients for .a weighted composite of the four AQE 
conq)08ite8 and AFQT against final course grades^ The values in Table 2 
show the validities conq)uted in 42 courses foj^ which the cutting score 
on AQE was at the 80th cent ile. These coefficients may ldok,a little ' 
•low, but remember that they are uncorrected and have been conq)uted in a 
^ample which has^ been subjected to severe, restriction in range on the 
-predictors. Since. the bivariate normality assumptions could not be 
nfet, no corrections for restriction were made. Hd^/ever, it is estimated 
that in an unrestricted population, many of these validities would be ^ 
found to be in the .60s', .70s, and .80s. The median correlation - 
obtained in the c9iiq)uting sdlnple was .42. The lowest reported validity 
is for a Linguistic/ Interrogator course for which the Air Force has 
specia]^ additional screening procedures. . ' 

Table 3. Validities (R) of AQE/ASVAB/AFQT for Cou^e Grades?* 
for AFSs with AQE/ASVAB Cutoff of 60th or 70th Gentile 
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':647 
.631 
.524 
.619 
.586 
.551 
.535 
.531 
.529 
.527 
.527 
.518 
.518 
.517 
.502 
.498 
.492 
'491 
.484 
.474 
.'458 
.440 



78 
139 
658 
163 
*434 
lOOfi 
605 
dOOO 
606 
908 
333*- 
1000 
1000 
1000 
892 
612 
1000 
65 
539 
iOOO 
291 
1000 
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Comp Operator 
Conq) Progxfilmner 
Small Arms 
AC&W Operator 
Radio Operator 



Median K - .440 Total- N = 28,707 



*For cases graduaLlag beLweca Jau ly69^iid Apr 1S74. - 



% Tkhle a repojrts uiicoriccLtd validities for 36 courses having entry- 
level requijfements at the 60th ql 70th centile on AQE. Again < these 
coefficients are attenuated by severe riestrictiOj^s in range, although 



some of the imcorrej^ted Rs are higher than .50* 

I might point out that five of the lowest six coefficients In tlhls 
.%able are associated with courses traln^g students In operator*-^ype 
jobs. Two are for radio and motse system operators, for which a special 
code test Is available to enhance prediction .of student success. The 
other three are for computer operator3f aircraft control and. warning 
operators » and sm^ll arms specialists. In each Instance , certain^ 
4>erceptual-psyph9motor skills.^ are reqnlred which are. not measured by^ 
the AQE or AFQT. " ' • 

The median uncorrected validity of the tests for. these 42 schools 
was *44 which, agaiOf a gross underestimate of values which would ^ , 
have been* obtained In an unrestricted sample. ^ 

Table 4'. Validities (R) ofe AQEi/ASVAB/AFQT for (Course Grades*' 
♦ for AFSs with AQE/ASVAB Cutoff of 40th or 50th Centlle 
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.55'2 
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42,973 































'^For cases graduating between Jaii 1969 and ^r~1974- 



Table 4 reports valjldlLlea Jiol grades in. 56 courses for which AQE- 
entrance requ>lrejments are at the 40Lh or 50th centlle levels. These 
coefficients are trigher bocause they are less subject to restriction In 



range. The median valu^is .53. However, these coefficierxts, alT^ ; \ s 
considerably below what would be obtaine'd in an unrestricted >^anq)le. ^ 
Not only have the lower 40 to 50 percent of th^, staJ;idardizat Jijn popularr 
tion been defied entry into the rourse, but th^ number ' of ca^s in the 
uppfer levels of the aptitud^' distribution is severel5(, limited , due to 
siphoning off by more demandinR cdl»rses. ' ^ 

Once again, by tl^e data I have presejjited, we can demoristrat^ that 
aptitude scores predict, course gradtes. I'm not sure, however, that this 
faqt impresses the average military manager. After all ,. one' cannot 
translate course gra'de points into dollars and cents or manpower bodies^ 
|ior have we been able- to demonstrate convincingly that graduates Vlth 
high course grades actually perform better on the Job than graduates 
, with low course gr^cies, even though they in fact may do so. 

Sfamnary of, Quyrent Status , , 

,So here we 'stand. Although we feel that aptitude tests predict job 
performance, \/e have very little data to sufSport tKis contention. We 
^would. like to^cjaim that the use of tests reduces attrition in training, 
but the evidence suggests that attrition rates ^r^ primarily , a function^ 
o'f administrative actions., not level of input talent. - We can show that 
test scores predict course grades, but t*his doesn't seem to impress^the 
average military manager. Where do we go from 'here?> 



Suggested Criteria JEor Test Evaluat^n 

It would be my recyWndation that, in the future, we focus our 
attention on five typ^s of criteria for test evaluation as follows: 
/ ^ * . . * ^ 

1. Sp'eed of skill acquisition . ^ / 

<i2 J Speed of skill decay 
3. Speed of skill reacquisitiorf Skills Maint'enaijice 

.4. Speed of ^response 

5. Accuracy of response , i^rformance 
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Speed and accuracy of* response may be important in some occupations 
involving ,a demand for perceptual-psychomotor or clerical skills. 
However, due to time limitations, I have elected to address only the 
first thi*ee criteria, which relate to tlje speed of skill acquisition, 
decay, and reacquisitiont In all three instances ,^ the. 6a3ic variable 
against which tests are to be evaluated is TIME . Time is 'an excellent 
criterion. It has a zero poijit; it carr be measured In equal intervals; 
it is easily .underst-ood by militar^^managers; it cSn^e- easily 
converted into dollars and cents or manpower spaces; and It is the 
single most expensive item iif the military budget. 

The military SQj:vi..c:ci :>pend ilLeiaiiy billions of dollars dach 
year supporting ttie cjevelopment and maintenance of skills. The morj^ 



obvlDusfeatpehdltUB^^re associated vith formal residence and on-the-job 
training courses, ISfft this ^.s just the top •cj^ the Iceberg. For exampip, 
the^ 'Alr Force spends hundteds of millions of dollars eaph year just to 
maintain pilot and navigator skills. 'Even more costly is the time 
^individuals in all se^ces spend in learning to perform riew tasks as. 
they ace^^countered ori a. day-by-day and assignment -by-assignmfir^ 
basis. To the extent that aptitude scores pre^di^ct the time required 
foir Indiyldiials to acquire and mhintaiy skills , they can be used ^ to 
reduce (pa^ and optimally ;d:J^Btribute talent to jobs. I will address 
thiSiissue^Jiring my remaining time. . • * 

Smis Acquisition «f 4 

^here is nothing unique or new about the concept of aptitude scores 
predicting learning rate. For Example, in 1963, John B. Carroll 
recommended that aptitude be defined as learning rate (Carroll , 1963). 
The first ^telligencetqst developed by Alfred Blnety bac^.in 1904, 
was designed to measure differences in the level of skills acquired by • 
Individuals during a cQnst^f^t time interval (chronological age). These 
scores were later nomted^ and coriverte4 into a score "mental age." A 
ratio of the mental age to chronological age was* computed ai^d came to 
be called th/e Intelligence Quotient (IQ) . Regardless of the problems 
associated' with the development and utilization ; of IQ scores, they have 
been, used for years- as rough indicators of individual learning rkjtes. 

* In the academic world, many tests are called learning abilities 
measures, ^dvh'ave been used for decades by teachers to place pupils 
into homogeneous groups so as to ^nimize variattce in learning rates 
withiti groups. Tests have; been shown to be valid prediqlfors of sqhOol 
grades, Jjpth in the academic w<y?ld of the civilian, sector and in ^1 
military services:, and^^hool grades can b^/jijjewed as the amoiint of 
content mastered by studfents when learning 

Aptitude tests also predict proficiency test scores in the s*^vic«fe, 
which are rough .measures of the amount of, content masteredM^jT individ- 

' uais at various career j>oints.* In VProject/ UTILITY (Virieberg i TayldO:, 
1972), which was cond^uitifrh for the U.S. Army by the Hum^n Resources 
B^esearch. Organization in the "late 1960's, AFCjU'sto^es w^re shown to be 
related to the rate of sktlKl acquisition in several QQcupational areas. 
Howfeyey, -with the passage y^^^me, an increasing proportion of men at 
all levels of AfQT appeared ml^he upper ranges of performance distribu- 

' tions^ iitdicating that' fif'^fliesp low-leA^el occupations aptitude scores 
predic^V^tfhe rate of 'skills acquisition, but -fot,^ ultimate level of 
performahce. 'Pilot £^aining^ programs are "generally locked-step. For 
tSls reason have been unable to J.o(5ate data deinonstrating that 

. aptitiide scores predict speed of skill acquisition'. However, pilot 

^aptitude teats do predict withlri-class elimination for flying^ deficiency 
and ^Individuals in the flyirig -research area assure me that slowness ;in 
acquiring Ikillg. is the pHmary cause for such elimination. This 
observati^ needs to ])e confirmed Tiy carefully 9ontrolled research. 
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While the evidence that aptitude scores predict. learning time is 
9iin||pntial, most of it is indirect. Outside of a few ^laboratory* 
exper*^nts dealing with paired associates learning, 1 hav^ been able 
V to locate few studies directly addressitig the subject, and these have 
involved iamall N's and pr'oduced mixed results. In one study conducted 
by a graduate student at the University of Pittsburgh (Wang, 1968) 
and i^ another stud^ conducted- by the Human Resources Research 
Organization (Wagn^g?; Behringei|L A Pattie, 1973) , substantial relation- 
ships were found tfetVeen genefitl and specialized aptitude tests and 
learning A times; howeVer, there 'appeared to be complex interactions 
among learning^ rates , ^types of materials to be leaimed, training 
modalities, aiid Various aptitud^ ^scores . If such findings are generally 
—confitwed^ -t heUp^rope r sele ction and cl as sificatipn of- person nel may b e 
vmore con^plic^ttBd t^an it appears on the surface. However, in one 
unpublished sfihijQconducted by the Navy,* no such interactions were 
found, and standardP Navy aptitude* tests were demonstrated to- have 
substantial validity for predicting training times (see Table 5). 
This Study involved . two tracts in a Nayy aviation familiarization 
course, one whlfch was made up solely of* reading modules, and the second 
which included seven slide/tape modules Interestingly ^ the higher 
validities were obtained for the slide/ tape group. Notice that tile 
equations predicting time criteria 'fcrr the two treatments were highly 
homogeneous. " " 

I was also able to obtain data for a 200-case sample of Air Force 

: personnel who recently completed an' individualized instruction course 
(Inventory Management) at Lowry Air Force Base. Two criteria, \^ re avaiV 
able, one of which was a summation of time to complete the^ccJ&Vs® 
bidckS) and the/other of which ^was a summation of course, l^lock scorel** 

^ (gr^gides), The res^iiltSiOf thl^ analysis are presented in Table 6. 
The' jiu^tiple validityi^^f ^aTASVAB composites and AFQT ||r the training 

/^tljue criterion was only .SQ^-which w^s significant, bu"t lower thap 
hoped forv. Howevef, the mu:\tiple validity of thtee ASVAB composites 
for the 3um of block test grades was .59, ^which is higher than'was * 
obtained for final school grades when 'the course was taught** in a ' 
loeked-step fashion. Even though 'this course is now .taught in an 
individualized instruction mode, there appears to^be more predictable 
V43n.ance in the amount oL content mastered than in the tJjne for course 
jcompletion. This tinding^ is explaiped, in. part, by t;he^fdct that ^ 

^tudent^ in ^he cyurse took module- and block test^ when' they felt they^ 
were ready for exaipinat|.on. Upqn first testing, some stuiient;s bai;ej7^ 



^Information ^this table was provided by Dr. Kirk A. Johfison, MAvy^^ 
Personnel Research: and Development Center Memphis Branch Officei \ 
MiJLlington, Tennessee. Multiple R's and cross-application R's were com-, 
puted by the author using the correlation matrices provided by 
Dr. Johnson. / • , 
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Table 5. VallditiesMJf Xptidude /Scores for. Time (Hr&) 
u Criteria In Najor Avlatfon FaiiLllarlzatlon Course . 



Group in ■ 7 Slide/Tape + 9 Reading Modules (N - 109) 



» • ^ Aptitude Test 

^ - . ■ 

GCT ^ ^ 

Arithmetic , 

, ' Clerical * ■ 

Mtatlple R 

Group tf2 - 16 Readftig Modules (N ^ 113> 

*^ , * Aptitude Test 

^ GCT • 

Arithmetic 
Clerical. 
Multiple R 

ftultlple^R's and Cros^-Appllcatlon R's 



Development 
Sample R 



.67 



Valld:^ty 

-.58 
-.47 
-.34 
-.67 



^ajlldlty 
-.45 

-.26 
" -.51 



Cross-Application 
Sample R ' 

/ .66 

.'53 . 



able 6. Validities of /ASVAB/AFQT Scores for Time ail^ Grade 
Criteria' In- the Air Force Inventory Management Course^. 

(N = 200) . 



Criterion 
Time 

Grade;^ 



Development Sample R's 
Predic tors 

— 

General AI, Electronic AI, 
AFQT ^ V> 

General Al, Electronic AI, 
Mechanical AI' 



Source of Pre- 
dlctlve Weights 

iSSe Criterion Pre<iictors 

Grade Crltejrlon Predictors 



Cross-:Appllcatlon R^s 



Application 
Criterion 

Grade 
Time 



Multiple R 
.39 ^ 



.59 



R 



.37 
.55 
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reached a 70% passing stan^rd, while others routinely scored 100% on 
many tests. These latter stt^dents had reached the 70% standard at nfcich 
earlier (but unknown) points in time, sik there was no pimple way to 
Compute a time^to-standard fot each case. In this sample, the correla- 
tion between the time find gracie criteria wap -.40, indicating that 
students completing the coursel in the shortest time tended to be those 
who mastered the greatest amount of content. 

There is notp*4jne' to discuss problems asso^ated with generating 
a pure tlme-to-standard criterion in the operational setting, but I 
would like to recognize that such problems do exist. It is imlikely 
that* individualized instruction coufses presently train all students 
t^ exactiT l^he same standard (although some meet a 90=^90TitandBrd) , " 
even though finishing times may vary. Until this problem' is resolved, 
•it will be difficult to establish jth^^^act relationship between 
aptitude scores and learning, rat gg>^ such courses*. Ultimate solut:^ons 
may include better records and controls, continuous testing, statis- 
tical corrections, and controlled experiments. One must admit that 
the problems to be overcome are challenging. 

It shoul/1 ]>k observed from Table 6 thap' the equations predicting 
the gra^de arid time criteria are homogeneCous. This provides additional..^ 
evidence that, since tests normally have high validity for yourse^grades, 
they should also be found to be highly related to learning time criteria. 
It 4s important, however, that direct relationships be estatilished. The* 
author would appreciate receiving^copies of any studies bearing on the 
question. , . 

Predict ion of Decay^Rates . • 

'■" "" '■ ■ : — — i ^ , 

A second stream of research which needs^to^e initiated concerns 
the ability of aptitude tests to predict depay rates for skills and 
knowledge's. There has been a great deal of research leading to the 
devfelopinent of generalized curves of retention, but surprisingly little 
research has been accomplished reflating to individual" differences in 
retention* Underwooyi published one summary paper (Underwood, 1954) 
in which he concludes that^ when associlative strength is held constant, 
there are no differenced in forgetting rates as a function of aptitude 
during the first 24 hours. However, this study dealt with laboratory 
associative learning experiments and short decay periods. The military 
service»> should be able to provide more definitive answers concerning 
individual differences in forgetting rate as a* function of aptitude^ 

One veW revealing study was reported the Naval Personnel and 
Training Research Laboratory in 1970 (Johnsofr, 1970) which provided 
data relating to the skill decay question. The study wfs based on 
material being taught in the first phase of the avioni^d fundamentals 
qpurse. Proficiency was measured by means o^stbe criterion referenced ' 
tests that had been used to validate the programmed instructional 



material used In this phase. Measures were obtained on a pre-^test , • on , 
an Anmediate post-test i and. at Intervals of 1 da^^, 7 days, ZS/days, and 
96 dVys following the original learning.. It was found that in spite of 
a fanly high level of mastery on the Immediate post-tests and 'a 
considerable amoux^t qf review, much of the material learning during 
the first phase of the course was forgotten by the end of the course. 
The differences between iri.divldu^l students trere large on the pre-test, 
were quite small on Immediate post-test, and Increased gradually 
over the remaining post-tests until, by the end of the course, they 
were allnost as large as they were on^the pre-tes£. 

Although this study was ba^d on only a fairly small N, It did 
provide a set olPrelatlvely unique data. The experiment begto i^th.1^1 
students. Seven were dropped for administrative reasons; 8 failed 
becau.se of slow progress; 21 washed back because of slow progress; and' ' 
17 were moved ahead because of fast progress. Thu^^ only 85 cases were 
left In the final samp],e, and these qases were' fairly homogeneous In 
terms of learning rate. In spite of this homogenlzatlon process, data 
in the study c^n be re-analyzed to reflect differential decay rates as 
a function of aptitude. Jka cin'be seen in Table 7, aptitude scores 
account for 24% of the fxnal ^st score variance, with originajL/pool 
test scoresi^held constant (parb;^! multiple R^). (Although one might 
argue- > t b^t /as so cia t Ive strength waa not. held constant.^ from||a practical 
stahdpolntif^it can be stated that individuals* showed dif f er^nfrfal' decay 
rates In ^criterion referenced tetst scores as a function of their apti- 
tude levels. ^ m 

Table 7. Retention of Elect^nics Fundamentals ' 
- as .a Function of [Aptitude , ^ ' 

Validities for Final 
Post^-Test . ' ^ 

Predictors ' / ^ ^ R"" > ' ^ 



Immediate^Post-Test \^-J-85 . .430 

Aptitude Tests s Alone ^ , /""""^s^^l^ .559 

Immediate Post-Test Plus Aptitude .382 . .618 

Unique Contribution of Aptitude Tests ' * .197 .444 

Aptitude! Testg^with Immediate Post-Test . ^ 

Scores Held ^Constant (Partial R^ and R> ^ .242 .492 




Predicting Time for Reacquisitionf of Skills 

The* third area^Wich needs to be addressed concerns the time 
required for reficqulsltlon ofy^ills tod knowledges wjiich have 
degeperated over time as a fmiction of disuse. One would hypothesize 
^:hat if ap.titude scoreasi^ict the speed of skills acquisition, they, 
'should also' predict ^he spee4 of skills reacquisition; but, to my \ 



knowledge, this has not beep firmly estab^^ished /n the military setting. 
I conducted one analysis ivi the early 1950 's which I now wish I^had 
documented, since it bears on* the question. A number/ of World War II 
pilots were reca^^led during the Korean conflict, and tfSnt to flight ^ 
Instrjictors' school. ^At the school, they were givep training to . , 
- re-establish their flying skills. I managed to locate the original 
World War II pilot a.ptitude scores for a sample of these individuals 
and foui^d, to my'amazetaent, that jthey were still predictive of flying 
proficiency grades fot students in this course—in ^pite of the i^asSage ' 
of time and in spite pf the original screening, -training, and differential 
experiences these individuals had-during aud subsequent to World War IC.^ 

, The question concerning th6 relationship between aptitude and the ' 
fcLme required for skills maintenance is extremely important. For . 
example, consider the pilot area alone, ^where jhe Air Force spends ' . 
hundreds of millions ^f dollars pe^ annum dSTtermp of fuel, aircraft , 
and maintenance costs in order to maintain flying proficiency. In. the 
foreseea1>le future, multi-million^ of dollars will be spent for - 
sophisticated simulators in hop ep vOf /saving fuel and aircraft associated 
with tfii^ expensive but nedossaijy program. Yet, we know^ery little / 
about the rates #f skill decay and regeneration, and practically poth;liig 
concetning individual differences in such rates. Ar6 individuals who 
' quickly attain f^llot 3kills also those who slowly lose such skills and 
quickly reyiin them after decay? . If so, proper selection of individuals 
into the pilot training program may be rao-g^ important than generally 
retognized. Because of the large numbers involved, the potential 
savings might be even larger :bn the enlisted side ,^ although they may be 
more difficult ta document. . 

Summary / 

I- realize that I have wandered "f^ar and wide in this rather loosely 
org^ized/ paper, but I will try tt$ summarize btieflyl I have suggested 
that we should begin moving away from job per forma,HcCft.4&tss/ fail 
criteria, and school grade criteria for "ap^iV«a# teSt^^^^ 
Certain types of perceptual -psychomotor tests and t^stV^Q? 
spered and accuracjNmay predict performance in operator an^'^cierical . 
t^PJB jobs; however, we stioiild not expect tests to have predictive 
Efficiency for performance in jobs where performance is primarily a 
function of the extent to which fully/developed skills are applied. 
Test scores* do predict the relative probatiilit-y of failure within 
training groups, btit they do not determine failure rates for groups as 
k whole. P^ss/f^ll rates are determined by administrative actions, 
rather than quality of input. Test scores predict 'draining grades, 
but gifade points cannot be easily translated^ into diSlafs and manpdOer* 

I haye suggested\^at we should demons tra|;;e^he value of tests in 
terms t)f their ability^o predict personnel time r^ijuirements for- skills 
acquisitioh and maintenance. ' 



C . • ' . . • ^. . 

Finally, I have enumerated some of che research findings tlf^ate 
f^ich l>ear upon cr^itical issues,, and have suggested research* stales' . 
#which should be undertaken. * 

I am personally c^viiiced that aptitude Jfests are indispensable in^ 
the military setting and- that they must continue to be utilized in s^te 
•^of problems which may exist with respect to test ^fairness. I have faith 
that ways will he found t!o eliminate or reduce test biases which may . h^^^ 
^xist. the i^ae tinje, I feel that we have an obligation to de&pn- ^ 
strata the yalOe of tests in terms of theJLr ability to* help us operate our' 
military ^Establishment in a cost-ef f&ct^ve manner. , . 

V ' ^ -1 ■ . -■- ^ —\ 

What is the value of aptitude tests? I cannot give a precise answer ) 
to this question; but; they are of considerably more value th^ most 
. military managers have' been led to believe. 
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. , ' SUMMARY AND CONCLUSIONS 

v . : \ ^ • ^ 

Editor's note: The panel of invited ^xiterts were asked to ooimnent on 
the' specific papers, presented here under* "Consultant Comments," and to 
provide closing summaries, Included imder "Summary Statements." 
Additionally, siiice these comments were off-hand and Verbal, each 
consultant was later invited to prepare and suBmit a moin 'formal paper 
giving hi^impressions of the symposium. Those papers recejfaged in 
response^io this invitation are published together in the ia^^ sect ion, 
entitled "Impressions." , 

i_y •■ '• • • 



CONSULTANT COMMENTS \ 

% *•"***■ 

Dr. R. Campbell: 1 was Interested In the discussion of the coqiblned 
Ipsatlve and normative approach to rating and I was curious, as • 
to the projected purpose. ^ 

■ ' ' . 

Dr. Mullins: Well, tl^e primary purpose Is to reduce the ibflatlon of 

means and to Increase the .variance. You have to get It. Whether^ 
V • this variance Is meaningful variance we won't know^intll We try. 



Dr. Brbkaw; The prol^lem Is that we're trying to determine whether the 
^selection and classification varla!b^es we've been using are 
' appropriate for that' task. ^ , 

Dr. R. Campbell ^' Okay, ycru can see other uses for such a maasure, but 
if i;Lt's restricted to that I guess it helpq^larlfy-it , for pie. 
But I think the work of Mike Beer at Comi^p^Glass was interesting 
in this regard. Are you familiar \d-th what he's done?" 



Dr. Mullins; No. 

Dr. R. CampbeU; It's not published yet. 

Dr. Mullins;. Maybe that's why Vm not familiar with it, 

Dr.( R. Campbell: He^s spoken about it someplace where I^happened to be 
and it will be published soon (Personnels Pjfeychology) • He started 
oUt with an ipsatlve approach and his purpos^was )mul^l^aceted, 
it'^was not only fopused on validation— I'm not eveft sure he had' 
that in mlnd~but ran into the same problem. .He needc^d an anchor 
becayse management I:^jected the ipsatlve approach. It di^n*t tell 
them enough for administrative matters. His anchor turned o^t to 
be an overall rating of performance. The wfapi^ ancfapxin^ issue 
raisers real questions about the utility of the ipsatlve' approach 
and whether or. not it's really going t6 yield anything. I find^i 
the most attractive aspect oi' the ipsatlve approach to be for' 
feedback to individuals on a diagnostic basis about their 
performance. Beyond thaty I have difficulty seeing how J.t will be 
very helpful, particularly when you seem to be moving in the 
direction of away fron^ a nusiber of dimensions. - ^ ' 

MaJ Sellman; I have jiast- straightforward desorlptive question on the 
' number^ of people who hhXre^ , talked ab^out dolnig'work on j,ob performanc 
measurement via simulation ais a real t^^^ung, that sort t>f thing. ^ 

i •! Was wondering if ypu could,' from the varibus branches, give %ome 
estimate of how many lives that's really touched, that is how mapy 
people to whom it has been applied, and' just how widespread Is it. 
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Col Ratllff : Mr. Cainm has gone. 



/ 



Dr. Muckl^e.ri ^ could give you some fourth-hand information from. a paper 
titeisirci. at AERA in April on that, and they were talking about how 
the^ implemented it. If I can remember right,- 1 think they had a 
dample of 150 in each of two divisions .over pa Germany, and it was 
on an experimental basis but from what I heard in ITew York that was 
th^ extent of it at that point— *the tryout over there. They'd sent' 
a rather large number o^ researcher^ over to Germany ^to do it. I 
don't know how Wide it fs gone beyond t^at, but. I know they are going 
to fc^Llow it tip quite a bit. • 



Maj. Sellman: Is there anybody in the Navy who has to go through simu- 
lation training? . . . 

A: Where simulation is used as a measure* of performance, I'm sure 
50,000 people' a year in the Navy ^re subjected to this. 

• ' ' ■» , 

Q: How many different jobs does thflt encompass?^ 
A: 50,000. 

Q: Is that done during training, post-training, or both? 

A: Both. Post-training use of simulation and assqpiated job perfprmance 

measurement within the Navy is increasing constantly. If you ask , 
^, ' me how well we're doing it, I would prefer not to answer that. 

Dr. Muckler: If yotr^don't mind, I'd 4ike to stick a summary comment in 
at tiils point and come back later. I've been somewhat bothere^d 
by the frequent reference to the expense' and the impracticality of ^ 
work samples, simulations, and the like. I would like to point out 
to somebody iiv the Air Force (and I have a feeling that the people 
I would like to point this out to are not here), that the price of 
o'ne' B-1 bomber would be more than adequate to do an enormous* amount ; 
' of work on the development of practical ^ useful work samples. I 
would also like to point out, and this time, I thinks to the people 
that are here , that there has beefn one area of • confusion in the 
discussions here. That is that there has been almost intermingled 
discussion of performance measurement as research criteria and per- 
formance measurement for operational purposes. If you're concerned 
about a criterion measure, you're concemied kbout research work, 
and I do not believe that it is necessary onl^en desirable to use 
operational measures of performance as rea^irch criteria. The 
practicality of the work sample approach 4o performance measurement 
^-vQu^t not get~confused between the practicality of its. use as a 
research tool and the practicality of its widespread use throughout 
the setli^j^e-^-:^ operatic tool, was kind of startled--I'm 
going*;tofe^5r^uote the sientence in Mr. Fpley's paper when he made 
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the statemient — '^At the present- time throughput his yhl^, career, a 
^ malnt|^ance^* specialist is 'not required to demonstrateron formal 
job task performance tests that he can perform efficlehtly and 
effect^lyely the taskd of ;hi6f job." I think tttis is shameful, • 
* because^J 'in reading more into it than Vas actualXy said atid I tttlnk 
. I'm justified in doing it if the Air Force ia anything like any^ 
other organization I've ever worke^ in. And if there are no formal 
job performance tests that an individual eVer has to demonstrate 
prc^fipiency o^ throughout his entire career, there probably is no 

, syltematlc means of evaluating that performance._elther^__We_live_ij- 

in a society that worships hardware, that puts all of its faith in, 
hardware, and that pays very little attention to the cost of the ' 
' human organism that "built the hafrdware,. maintains the hardware, ' ll 
and operates %he hardware. Aftd until we get the notion that it is 
not practical t^ build allvthal^ hardware without giving some 
attention to the people that use it and do something with it, we 
really a^ren't going to be talking about anything very practical. 
End of sermoti. ' 

'Dr. Bibkaw: He's not here to defend himself, so I can pick on Ray 

Chrlstal a little bit. If I can read my notes I can'pick'on Mm, 
He identifies speed as the all purpose^ criterion and level as the 
all purpose predictor. Now that Suggests that a lot of people are 
wasting a lot of effort in a lot, of places. I would like some 
individual and consensus responsfi||kto^this concept . Do you think 
that "'fchis' could be an artifact blRuse lie worked cin groups which 
are ^already separated in terms of classification? He Idoked at 
mechanical people in the context of other piethanical people, he 
looked at electroiiics people in the context oif other electronics 
people. He has not yet looked at these people in . conqpetition with 
each other , Did I put everybody to .sleep? 

Dr.^ Hutchinson: I'd be glad to respond but not to' that question. 

• . ■ .1 

Dr. "^tiion: It seems to me that — this is going to be on the tape so 
^ Ray can hear it, isn't it? Okay Ray, here we go. It seems to me ' 
. .-^^ that what he's done — what you have dona>, Ray — is to move back to ' 
World War I when we got all those beautiful charts that were repro-* 
duced in evety eiementary psychology textbook for a period, of a 
generation or more showing the mean and standard deviation of 
AGCT scores for vai;ious occupational groups. I've always found 
, C^^^ diagram to be one of the more jjlterestlng and useless diagrams 
"^t^/^V^ elementary psychology textbooks. . Students spend, a great deal 
. V ^of time pouring o-^^f it trying to 'decide which occupation has the^ 
't^ intellectual prestige to<^hich they aspire, but I have never found 
-"' any practical usefulness for it in a non^nilitary^ selttlng.,. If you 

. go^s far as Ray went and' identify the crucial problem /for military 
\ ^lir^ervrlfces being a placemenCor classification problem rather than a 
^ ,^^c«fc^^^(aSf$tSLbp problem, I think , that the( oversimplicity of this model 



becomes so obvious that It no,.J.onger has any -Interest Wfah^yojU: 
were herie, ^^R^. * • f 

Dt/-J.'itelnpbeil: 1 would add a brief comment to that. I think^ Bob .was 
* saving appropriate^ things about the aptitude distributions for 

different occupations. Howeyer, I think that .is separate issue 
from whether the. time it takes to reach mi acceptable level of - 
job proficiency is a useful criterion for. selection And classifi- ^ 
cation reseirch. 

,-Dr-. Guipn.t— I'm only talking about the general level as the general! 
\ predilctor. . - 



Dr. Helmick: I would like to us'e this opportunity to raise a. gen^fal 

question and apply it to this particular situation. .It. seems to me/j 
/ that one of the thing^ that , I saw getting lost in ^he discussion . h 
over the 2 days was^ the distinction that Dr. Muckler tried to make j 
between measurement and criterion an-d the* concept* of the judgmentaXv 
aspect that goea into what I would agree is the reSl, true aspect ;/ 
of the criterion. It seemed tb me that the speed detegminati^iiji 
as I understood it to be described, was essentially another / 
measurement and really had Tnothing to do with the definition of 
the criterion. And I think, it's a quite appropriate yiay under 
certain circumstances to mea'^ure the criterion. It may very well 
in many cases be a better way. Where you have mastery criteria, • 
speed may very well be the only alternative. But that dipesn't 
answer the basic question of speed' to do what;. * How did you decide 
to measure the speed to acqulre^this particular Idnd o? performance? 
\ \t seems to me that a great deal of thie' discussion thi$ morning 

as well as yestex-day was 'concerned with measurement problems 
. I'm- certainly not averse tpVtbat. Jleasurement problems are very 

^ real.- But I think sometimes ye stay in otir difficulties because 

while w^ do refine the measurements, we still may not be measur- 
ing what we would like to if we stopped to think about it/ 

Dr. McCormickc Perhaps in defense of Ray Christal in his absence here^ 
^ I wouIdJ-ik^ to say that I believe that his position i^egarding 

"level" requirements for jobs does have a fair amount of validity 
to it. In other words, I thiilk there ia; som^ tendency for people 
to gravitate into^the klnjis of jobs which are commensurate with 
their own levels of ability. Those persons who hive that^which 
it takes to perform a particular job may w6ll perform at a differ- 
ent level on some test or other measurement instrument than ^ 
- persons on other jobs. I think inisome of our research we have 
* some evidence to support tl\is. The assumption that people 

* generally gravitate into jobs that are commensurate^ with their ^ 

own levels of abilities is not a coin)letely valid |mie, ibut at the . 
same time I think that there Jfe en6ugh\^ub.stance to this notion 
to support Ray's point that "ievei" of performance on various 
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kinds of testis miay be a reasonable, critieriign^for t^he selection or / 
placement of people oii the jobs in tiuestion, With respect to the ^^f 
inatter of "tiite!;^ t^^^^^ jobs chat he discussed, I- thinH ^ ^ 

^ basically the notion qf time does make^ aWrtain amount of sense, V 
althotigh it does nott.cbpqjletely avoid theV^usiness of making^ «ome 
kind of determinatiQji abput the level of prbif iciency. In otherV v 
words,, to determine "'that i:he time required yo achie^ve a certain^ 
^ level of proficiency one st;Lll h^s to make ai determination as to 
the^level of proficiency that you are taliking about, so you do ^. 
not completely avoid the business of evaluation, rating, or i 

" performan or what not by ithe use "or^^iffl^ ~In ~~ 

connect ipn with this mat t(^r of time, Stanley .Liifpert (whom somtfc 
of you people pay know) redently turned but a v^y thorough ■ ^ . v^* 
analysis of learning curves in^ whidh he has/f(jund\sbme generalizabl^ 
curves in' which , he has incorporated provision yqr jteasurenient of . • 
the level at which a person begins learning 4/hatever it is to be* 
learned. On the basis of his^ evidence,- I «iink that if time is ; ^ " 
u^ed'*!&s a criterion, the te should be some pro^gioh f&r incorporating * 
admeasure, at the initiation;" of the performance level ' at Vhiqh 

^the persorf^begiris the training in que^tio^. * ' 

J. Campbell: I don't know if I can add anything^ to what 's beeitf^Mid, 
but it^eems quite 'reasonable to expect that as the militar^ser- 
vicea^feVe toward more self paced training, s6m& good- criterion 
measures to considei^ would be the* time to training completion and i 
the time to reach iob proficiency. Another use f^ criterion migtft^ 
be the amount of decay in job skilj.p after* a certaCin amount b£ time. 
However, none of thefee gets one out* of the bind of having to measure 
p^rformanc^e itself. Without measures of job4)erformance, an^ a good 
definition of what constitutes an adequate performance level )^ it 
would not be possible to determine the time it stakes an Individual - 
to reach "adequate performance." .Thus the jdevelopment of a ' ' 
criterion based on .time will be mi^rev hot less, coii4>licated tha^ 
V, the usual; kind of performance assessmetlt. However, I'm sure this/ 
is not news to^ Dr. Ch^istal and that he well realizes the dif fi- \ ' 
culties involved. I think his argument is that, in spite of the 
difficul^eies, time is a very valuable criterion for military 'j ' 
organizations.'. I also think he is right. However, perhaps with- 
out meaning to, he rather quickly sli^i over the problems that, will - 
be involved in rating the time demands for^ Various job tasks. It 
won't )>B easy, and: it adds another rating task to whatever is 
alifeady required bf whatever ssample of raters is available. 

I* . . ^ . . ^ ' 

On the question of how* to classify or place individuals in differ- 
ent Air Force jobs, I 5on't think I Was able to fully understand y 
what wa^ said and thus should not fomment on it. NeverJthele8fl^^*^ ^ 
think ^solme of us inferred that he ilas advocating a return' tc/job 
plafeement via differential score levels on bne dimension of^ 
overall ability. However, I don't think he would take. such an 



^extreme position. ^We*^ didn't hear tprrectly*^ 

* > Another aspect of the' general proble^ that seems missi^ig from the 

'discussion so far is* that somejjob tdsks are more "ciritical" than 
others^ arid predicting the tlanff to le^m the criticki" tasks would 
be more Impartant than predicting ttifi, time to learn, the less 
critical tasks. Another feature of the criticflness ^of task^s, ^ 
which was recognized in a stu^y pf Navy enlisted personnel by - * 
^ % Glickman and Vallance; 13 that theW are often a finite number ^• 
, of identifiable ways that people f'ail at a job. That is, it is 
often possible to describe, in concrete behav^ral terms, the 
• , mo$t important mistakes that people make. If n^e objective is 

to select people who will minimize such mistakes then perhaps the 
most appropiyiate criterion is not the time it takes t* perform 
' the tasks* adequately but the absolute level of proficiency with 

which to iridividual can learn to perform the task^giveri a 
reasonable amount, of time. S 

Dr. Brokaw: We've had a lot of discussions of ratings/^ We've talked 
about ip sat ive ratings, we've talked about normative ratings, and 
we've talked about doing away with I'a tings in favor pf performance" 
tasks, and yet we seem almost always to comeback to look at them, 
again. I would like for you gentlemen to t&k us whether we shoilld 

• go bur merry way with ad hoc ratings as they^ee'm.to be. appropriate 
or should* we spehd some time on attempting to ^e3jelbp; some 
specialized rating kind of processes whereby we either train ratera 
to levels of proficiency, or we identify raters who have sucfcess 

in the skill of. rating objectively, of, , what should we doaboiit 
this rajting problem. Should we assume that all the problems are 

* answered, or sihould weypuraue our researc^ in that domain? 

Dr. R. Cami^bell: I can give you a brief answer to that as I think 
^ ratings will be *wlth us throughout my llfetinie; however, I, was 
encouraged, by the emphasis on proficiency measurement (as 
distlngtiished from performance measurement) and I applaud that 
% , work. If you've got proflcieacy measures for 50,000 jobs, I 
"•j^ " think that's marvelous. „We substitute them for proficiency 

ratings whenever possible in my organization. The fact is though, 
we will need ratings for other purposes. Now I certainly hope 
we would not use ad hoc ratings. Somebody here said we shouldn't 
use them, I think , maybe several people did. I'm not very-big<r>n 
"rater accuraoy'^as the w^y to go. Frankly, it's an unfrujLtful 
way. I prefer improving rating conditions, and the training of 
raters, partifcularly if we're^u^ing these ratings in research 
situations. I t"hink much can' be done to m^e tljfe ratings better. 

Dr. Helmick: I would certainly agree. I think that from all of that 
I was encouraged by^e attention that's being given to improving 
ratings, although I do not disagree that any time we can find a 
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fetter measurement; than a rating we ought to use it. I guess the 
only specific point I would taise in fionnection with the report on 
some of th« work being done would be the emphasis, as I understood 
it, oq^rylng to validate ratings against paper-and-pencil tests. 
Coming from one of the largest suppliers of paper-and-pencil tests 
in the wo:||^d I certainly have no Objections to them, but I''have 
the feeling that modifyigig the rating procedure to produce results 
more like the pap^r-and^encfl tests would npt nece;Bsarily ba dtt- 
advancement In approaching the trttfh. The kinds of things that 
canbe effectively measured by paper-and-pencil tests m^y belfess 
^Wr which ratings may be the only means avall- 

Dr.McCoinnlck: I think there "are two kinds of circumstances under which 
tatings-will continue to be used. In, the first .place, there are 
certain kinds of job activities which by their nature I Relieve 

caft best be evaluated on the ^basisftf subjective judgments of 
other people. As an example, dn the* case of behaviors of inter- 
.. personal naturef, human j udgment s 'abibuljjteuch activities might be 
. . better than any other kind 9f meas^rdf^ In the second place, 

ratings will, of course, have to continue to be used in the (case 
°5 *^^^^8s that theoretically at least can be measured objectivelY 
but tha|. we have ■ n^t beft^iu.l?^ght enough to figure out'ho)w to I ♦ 
measure?^ .&)w, a^ tl^i^JB^ what. We caU '^rafringsT' I prefer 
really to ffhink pfr^MB||^'%spQM^«hicM 
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required to make. ' In. ^M!»^ *-of /^orivent^^ ratln^^t^I rater - 
is asked to make a^soluff- judgments , as contrasted wltff-the making 
of relative jiidgmelrts, whence talk about what 1 sometimes call' 
personnel, comparison .systems f/ike ranli «r<ier, forced distribiition 
paired comparison, etc.). I am in accord with. Ray in his talk * ' 
about- the use of relative ratings. I think the notion of ipsative 
ratings also falls into this ballpark too. I think that the use 
of relative ratings can get around some of the problems of 
inflation and bunching up. However, there are other kinds of 
"rating procedures" that do not require the making of judgments 
or evaluatioilts, butr rather that require descriptions of beha\d.or. 
I am thinking hfe*e of vatious types of scales and checklists such 
as behavlor^al'Sispectation scales, the forced choice checklist, 
etc, where the "rater" is asketi more to describe, someone's 
behavior, rather than to jud^i^or evaluate. I would heartily 
endorse any «forts to make com^risons of the effectiveness of ' 
these different kinds of human responses, both in t4rms rtf 'their 
. psychometric properties and also in terms of their prStctiVal 
utility in connection with the whole matter of criterion ^elop- 
meAt. / . • . 

)r. J. Camfy^ell; In general, i guess one could say that any research 
on ratings is valuable. However, there are certain kinds of 
research ithad make me mote nervous than others. It seems to me 



that research efforts^ devoted to discerning' the yalue of different; 
scoring procedures, diffe^ent^ formats, different transformations, 
• etc., is not really the -dire^on to take. The historical recor4 
hfere is cl^ar. These kinds of variables don't seem to make very ^ 
' much ^iffefence in the ^reliability atirf predictability of ratings. 

If you want to choose the .one thing thijb makes the most dlffer- 
f fence, I' ta.nk it is the mot ivationaL contingencies ^under which 
A the rJfc,er' Operates. Also, I don't think the suggestion to be 
■ '^mo^^ descriptive than evaluative will help much. People (raters) 

know the purpose for whiqh the i^atings are being made and khey \ 
/know the rewards an4 punishments that , are contingent on their , 
' behavior as- raters . t^These motivational concerns are a significant 

influence on how they use the rating instrument. If we don't 
' deal' Vith. these concerns then we don't deal With one of the major 
\ problems involved in the evaluation of one person by another. In 
my opinion, a /second major determinant of reliability and accuracy 
4j> ratings is how well the raters understand the content- of the 
behavior to be rated- B^ck -a decade or so ago when Smith and 
Keiraall stimulated our interest in tjie method of Behavior 
Expectation Scaling, people showed an^ interest in this technique ^ 
for one 6f two major reasons, Som« saw it as a way for sampling . 
and describing^b .behaviots ib a more complete^ and meaningful < 
^ay than has 6ver been done before. For others it was a new way 
of dealing with the traditional problems of unreliability ,. halo . 
error, lehienjcy; etc. I think research on /he BES method got on 
the wrong track early by emphasizing the latter and nbt the former 
objective. People should worry more aboutf^ the "goodnes's" of the 
description of job behaviors to be rated .Ind, not so muclT^abottt^ 
halo or lerflency. In sum, I want tqu assert thai: two^ maior'areas 
of needed 'research are .the motivatibnal considerations influencing 
rater behavlor^for research as well* as operational ratings) and - 
way;S->in lAiich domains pf critical jab tehalrtors 'can be Better and 
more usefully described far the raters- V ' oh. ' 

^ V If ^ ^ - 

Dr. -Broicaw; Could the members%of the panel commenUfln types of rater 
^ training programs they might have en'countered like with*, the police 

department? Do ^ou ever come across programs whefe..they literally 
v \ train you or someiioW try to ge^ the rater to make more accurate, 

valid, reliable, or useful ratings-, more meaningful ratings? 

Dr. Gulon: L^C me stick that into the more general comment that I want! 
to make'. I Sks going to let Mac speak for me on the rating issue t 
^ until John started to cbnfuse the opeH^tionaT, 'ratings with the 
: research ^atingsl And the tiling that "I think has to be recognized 
'^'"^^^N^d^h regard to the researcjt ratings is ^that even when you take the . 
. punishments and the rewards out ot the thing and ypu tell the • 
raters that what they're really cfoing is making ^t possible fqr 
them to get better people in the future or something of this^sort^ 

to get hurt* or helped by their ratings in / 



.and that npbo^^r* s -going 



this particular set of ratings, the:|^8till can't do it. And this 
is true when we've given vide© tapes ^nd training programs to 
them in a wfte variety of different kinds of efforts. We have ^ 
•used films of actual police .call^, for exan^le, and gone through 
. a great deal ^of intensive effort to get people to observe, 
describe-, evaluate, agree on the meanings of anchor terms ,* this 
sort- of thing; we still end up with many raters giving us terribly 
unreliable ratings even in tha\ wholly laboratory situation whe'le 
th^re isn't, even the reward system of the research -ratings." ' 

I think that one of the things jfou have to recognize in responding 
to the question that w^s originally raised, is not merely that 
ratings, will always^ be with us, but that they are ubiquitous. I 
thinie ^al; we would do better if we stopped using the term "rating" 
and us'tgd^the more' general term instead, of judgment. We would 
recogniie thfenthat all the rating systems that we use as criterion 
measiires, whether they are ratings per se or ratings of. product or , 
process in a.^ork sample, or the evaluations that' a^ made when 
someonq is given a trial period of pei^formance on.^ job, such as 
a probationary peri(^d, whatever the context in whi'qTi the criterion 
measure exists, the rating is simply ^ tool for ob^ai3lng judgments. 

The paper by Uhlaher, Drucker, and^^mm has pne interesting state- 
ment in it that would be interesting to -question them about to see 
• if they reallyWA it quite as . it sounds. It offers the hypothesis 
that ratings ^re more ll^eel^ to be "actuate" in those situations 
where s<W kind of- inter-personal activity is involved. That's an . 
interesting hypothesis' If this is true, then we should be using 
not only \ the whole process of judgmeht and perception research in ' 
ourn^eseirch on ratings, but we 8houl,d be specializing perhaps' on 
social judgment theory with all of the lens model implications, 
policy capturing implications, that this sort of thing has. - 

I guess the answer that I will have to give to your question, Lee, 
is that on the -way down here I was reading these papers in the 
same Ve^^^ which I had the first draft of two theses, one of 
whioK I've already told you ai^out; it was the pr^d!tction of raCdng 
accuracy study that I mentioned. The other one was an .interview 
study where we tried to' detfermlne the effect of non-Verbal cues 

.on interviewers' judgments Which was a rather devastating kind of 
non-Tinding -when we got all through with it. This, ^coupled with 
the fact that I happen to be at a university that has been 
specializing In the. person of one of its faculty members in social 
Judgment theory for the l^st (how many years. Jack, '8? 10?" some- • 
thing like that! ) , I- have made a voW to have a mid-career change 
^nd d^votfe'most of my attention over, the next fpw ye^s to the whole 
process of judgment in the evaluation of anything,' performance, 
product, 'consequences of beh&yioi', whatever you like— because'^ 

'think that most of our criterion measures' ultimately Involve "the 
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process of judgment. CertaiT?ly they involve the process of 

judgment if we" make the distijictdo'ii'rtbatwas urged upon us ' 
• yesterd^ between a masuring Insti^Mfjji^aStL-t^^alue* judgment , 

that turns th^ iiiBasuring^liiati:'umedllHSto\a criteTi|gj|^ , 
these jiidgnents are arrived at, the^xnultiplicity oJ^p6i±Gip8 In 

arriving i^tKose judgments, ali ^pl 'theise are" of crucial in^rt^ce 

if we're golhg to evaluate criterion measures. , And T don't thinfe"*^^^. 

£hat vie can simply walk ar^^^0^iom ratings even as we;.- walk toward,;. / 

job s'ai4>le8,'Simulations|i^^d that kind of thing. 



Dr.* J. Campbell: Althp^ii^^ this notion is not original with me, I thiiik 
there is a law"-(^|=^ature that says objective measures are really ' 
subjective measles, at least one. step removed. Behind every % 
objective me;i(sure one can turn up personal judgment somewhere, and 
all the problems inherent in making such judgements come home to 
rest . That is why we all shouW be very concerned with problems 
of perception. Person perception research in social psychology, 
for example, has built up a huge literature on aJjOt of trivial 
things but also a lot gf things which are very reKvant'for this 
situation. To mention just one, there is a large literature 
concerned with the influence of stespotypeg on judgments. - I can 
recall a study in the organizational literature by Wayne Kirschner 
which discovered^, fairly clearly, I think, that if you took two 
kinds of supervisors', those who were judged to b6 good supervisors 

^ -W and those who are judged to be bad supervisors, tKey had a very 
-different stereotype df what a good- employee was. As a result, 
one might expect them to rate different people highly or the sam^ 
people differently. The person-perception literature is a ^ig 
area and -to be a well integrated investigator of problems of 
personal judgment (e.g., performance ratiftgs) , you must jump into 
it at some time or other. ' - 

Dr BrokflV: Does anyone in the audience have a question? 

. ■ • i . • • . 

Sgt. Winn: i'-1^e1ot a question. I'd like a quick summary of what the 
two different kinds ^of supervisors thought was a good employee, 

Dr. J. Campbell: Well, a "quick" summary is that for the good super- 
visors, their stereotype said that a good employee was a little 
' mavericky, a har4 driver, a bit of a non-comf ormist ,^ etc., whereas 
the poor 'supervisor'^ stereotype said that a good employee was 
dodile, don't make too many wave^, 4tc. ^I'm overstating the case 
a bit,^but the descriptions were of that'^nature. 

Dr. dvlion: .^O^r studi&s are of tr^iined versus untrained leaders, but I , 
X think ?n Jbat of them the training has to do with familiarizing 
; .peoplfe with what is halo, what is leniency, and where "ate aU these 
. so called psychoffl^ri^'ferrors. And it's very short.' I have a 
paper here by onp Bowling Green's ex-students which compared 
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tralnfed versus untrained tjffers, where/lhfey we jre draining, arid this 
^ sort of thing. .But I thinSme trai^dni that's really crucial 1^ 

tralrting In what are you golnfg- to obsenfre, what are you going to 
call/Jiigb, what are you goinfe^to caHrlow, etc.", and It's taking a 
. long time. I mean we realljFtpdln them. / 
- ;. " . . ' . . ■ ■ 

Mdj Sellman: In Flying Training, we have kind of a un'lqu^ problem and * 
• .that la to obsefye.lO minutes of behavior costs several thousands 

of dollars If you fly an airplane, so you want to make sure tJhat 
^ wljatever rater you have Is the best possible rater, a^ piJlbtsSend^ 
» to be very good raters to .start with. They really know vff^t they're 

doing. But It's just a very difficult situation, and this Is in a 
"'^"-.j^ research area mostly. I'm not sure, exactly what happens out:-l&-' 
~"<^sith%~I mean just out there doing it. 

. Jft. Brokaw: TheTe.„was a question Bob raised a v^ile(ago of how to get 
two people to agree that they've seen a specif icfifeha«4or. It's 
y np small problem. . 

,. * 

Capt Curtonr I just wanted to ask Dr. Campbell from AlSx if he would ^ 
comment on thet types of instruments they use to vallate their 
assessment centers and prbmot ions that result%from the assessment 
^at types of criterion do you use in that situation? 

Dr. R. Campbell: We have used several different types. One is advance-"" 
ment in the organization. The assessment centers are designed 
usually to show potential for advancement or potential for certain 
lilies of work, and the criterion we hav& used most frequently is • 
actual advancement in cases where the 'assessment .center data was 
not fed back to the organization. So that's the most cpmmon pne. 
^\ Another, is to set^ up spec^l judgment situations where~I qan think 

of a sales example where ^e were trying to validate an , Assessment 
program for salesmen- Where there is a prescribed procedure for 
opening a sale, how you close a sale, how you do usage prospecting— 
gating information' and so on, and there is a trained set of r^ate^s 
who^rxD«ially go around the country doing evaluations of people j^b 
inr^that case we usedW research judgmental procedure. Another 
.approach that's h^^en used vith some success, at least in terms of 
^showing validity, is instead of usjyig the ratings that are in the 
files, administrative ratings, we have trained interviewers go Out 
and talk to the jsupervisors who report on the behavior of <he 
iAcumbent^, and xhen the interviewer makes the judgment about where 
somebo(^jr falls on a certain dimension* Those are the main three— ^ ' 
. we've used sal^iry progression but we avoid using ratings that are 
"available." ' ^ • , ^ 

Col Ratliff: Dr. Campbell, in yuur aoocaament center where you use 

different people as pait your assessment process, do you think 
that participating in the assessment process makes a difference on 
their later ability Lu ast^ess people? 
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Dr. R. Campbell:' I bedieve so but I have no evidence to^ubstantiate 
that. In staffin^^m assessment center, we do hot rely on/ 
.selection of good^ raWrs. We do provide training, '^explain the 
dimensions to be observed, and train them in observation and 
judgment. Fprhaps the best training that they get in the whdle^ 
thinjg is that they participate as a team of raters ift which they 
have constant feedback on the judgments and observations that • 
,tjW^jre mafcing. The i^oi^m when they get back to the field is 

they may be bett^^rained judges, I believe they certainly,, 
are, now the rating conditions are different. Now they don't have 
.the full observation, they're rating people on standard tasks, 
so whether or not their judgment's are in fact- better aft^* they get 
in the field I really can't say \^1 though we think they're better 
trained raters. . 

Col Ratliffj You mean they're more conceited about their judgitients 
^en they get back. ' . 

Dr. R*. Campbell: No. It's one thing to be trained in what kind of 

behavior you should be observing, what it applies to, and what the 
anchors are, but you've got. to somehow set up the rating conditiJms 
so that you see the behaviors when you get out there . And Tou 
don't have that same control on the everyday field studj^ that you 
have in an assessment center where everybody has gon6 throvigh the 
same tasks. How much that impacts the "validity" of the judgments 
that they're making, in the field I'fii not sure. " * 



Maj Waters: I'd. like to just sort of cDmment. One of our divisions 

that's hot'* represented here, our Flying' Training Division , is *doing 
^a concerned effort in the performance evaluation area mi the flying 
> game, and they're specifically looking at Automated .peVfdmdfice 
measurement in the aircraft and pilot tasks. Since Jack w^ pretty 
much involved in that I just thought of it when DrC Campbe.M - \ / 
mentioned subjective measures. I think there is one case where 
there probably isn't any subjectivity in the measurement procedure,. , 
but there may be questions about validity oi tlj^ data that you're^ 
collecting. Ja^ck, I don't know if there's any^ng you want/to ^ 
say about that, but ; . . - ' 

Capt Thorpe: I kind of disagree. The reason for that is .that we develop 
a lot of interaction between the computer and flying in the simula- 
tor and you can get a guy in there and the students can thr^h ^ 
around and you can collect 35 parameters 20 times a second and come 
out with reams of data. If you're skillful you can reduce that to 
even one numbej: like 25% of the time he did well or . something, but , 
then when you go to the back room tc^JHnftdout how all these measures 
were devised, there's our pilot baclJ^ thereT" with one of our skilled 
prograil^rs, and he's figuring out what measures to measure. And I 
think it's pretty much the same judgment. It's his judgment of what 
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he thinks are the things that should be n^sured, so maybe we 
measiire them more reliably or more acc«lfately or in a time domain 
that's much' more specifiable. We measure perturbations that 
normally we might not be able to write down fast enough as Ahey 
happen, but what it all boils down to is there's a great de^d of 
subjectiveness in objectivity. 

Dr. ^Juion: Last night in the bar we were having an intellectual 

discussion with Jack and he was describing what seemed to me to be 
the ide^l measure for the evaluation of trainee pilots. This was 
brought about because of a-^ test that I toVd him about< that was 
used by the City of Honolulu t* ^yaluate candidates for fire, truck 
drivers ,^ which involved a fellow named M^tin Luke riding with tlie 
candidate up Tantalus Road. If any of /you' have ever been near 
Honolulu and know what Tantalus Road is, just visualize taking 
that road, with a full length fire trucTc. ' The score, Martin would 
say, was the number of drinks he had to have before he could write 
up a repott after he got back, down tp the valley.' . Jack' s story was 
that the real score would be mieasured by the pressure of the check ^ 
:^^^-*-pilot 's hands on the arms of his seat. Now all that, of course, 
is barroom nonsense, and like a/lot of other barroom nonsense there's 
' - a great deal of wis^bm^in it. Dbvio=tt&ly, the bar-seeking check 

rider with the- fire .truck or the arjn gripping check pilot is making 
• a subjective judgment about the quality of * the performance of the 
person be^ng sccfred with either of these.. The question now ^econles • 
one of how you rexrprd that subjective jddgment an<J I submit that'^ 
' a 5-point "tarting scal§ is not going to be as' xeliable and valid, a ^ 
^ recording of the. subjective judgment as som^/kind of a dynamometer 
on the arm of that chair.' And of tourse what you'd^have to^ do 
is develop some sort of a personal equation^, a kind of a chiekeiT^ 
factpr, for different observers so ^that you could mal^e^a <;]^Tection 
for the timid versus the foolhardy check pilots. But^ the pjpint is 
that even thpugh it; is still a subjective evaiuatio»: you '.te "getting 
here, your, method of recording that evaluation does not ilwqys 
have to 'be a rating scale. And I think we ought;,, to lnves^i8a|:e * 
some other .apt>roaches to recording subjectivity. I aon'tQ^\any ^* 
goo^reason why Vou can't do a little spelling with tliese guys^at 
db the check riding and get some GSfi data if nothing else and ^ 
it as a criterion measure . And I am being only maybe 10% f acet^ 

Col Ratliff: i ipight point out that the Russians have a little test 
called the "Falling Down Tt^st" that they use in their pilot 
selection program-wlth which they instrument the individual's 
blood pressure and pulse and things like that^and allThe has to 
do is stand up straight ^d fall forward on thfe floor. \And .tl;ie ' 
intensify of his physiological reaction duiring that^petiod was . 
recorded and held against him, I. presume. | * 
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Dr. Guion: I^d like to say first of. all that this has been a marverpus 
vacation; • I've been here now nearly two days and the Federal.^ 
Courts have, not really intruded themselves into the discuBsiofi at 
any point. And I do not recall a 2-day perfo<i,. other thait'wh^n 
^ I waaf on. vacation, when UagJ^ been free of concern* for the effects 
' ofcplxrts in the last year^j/m not entirely sure that you should 
have given thkt va<»tion, because I think that somQ of the 
'concerns foTt the cbur« may become your concerns; And even if ^ 
• they don't, one of the effects of the court involvements t?hat 
we've had over the last fe\^ years has ^en a re-thinking, a very 
needed re-think±ng, bf ^the« whole , concept of employee splection^j^ 
validation procedures, et'c.^ And 1 think that you could 'do, well ■ 
to raise^he same^ kind^of question with all of *the things you're / 
doipg, namely, how would I defend this if it were phalleng^d in \ 
court. 1^ , ; 

' * • I raise this question particularly with regard to this material ^ 
that' has been given the unfortunate name by you people of -j 
synthetic criteria. This is one step worse, I think, from a> 
semantic point of view, than synthetic validity, of which I MVe 
bpen guilty. Obviously, we a^e not synthesizing either validity 
or criteria. You already spoke this morning of the semantic 
^surdity of tthe phrase, but it means something mpre than that. 
l£^ means .that when we are not thinking about the Courts, but are 
^ thinking 9nly of oUr professional colleagues, we try to, put 
everything that we do into a framework of validity whether it 
^ - fits there or not. 

No\j if you look at the APA AERA NCME ' standards ,' you'll find t^iat 
^ va],idity comes under three guises: criterion^-r^lated validity, 
construct validtfy, and content/'validity , which iii a paper a 
couple of months Vago I said doesn't exist. I thtok it's down to 
7\ two. Criterion related .validity is a pretty straightforward kind 

of thing 'except f 'hat it's not really coricwmed with th^ validity 
of a t^i, it's, concerned with the validity of a hypothesis, the 
' ' hyepthesls that some measure can be predicted by some other \ 
me5^, either actually pr.^icted over ti^^ or in a purely 
- statistical sense in a concurrent kind of %tudy. Construct , 
\ y validity is >a very complex id^ and very few of the things that . 
\ " ha% been thrown about in court discussions of late under the 
, heakng ot construct validity have any resemblance to the kind of 
Cronbach and Meehl notion of^ construct Validity that started the 
notion several years ago. The point of all this is that in a 
Sup'reme C^rt 4^cisiort there was a term used called iob related- 
* ness." I do^i't know who . developed the term, but I likfe to think ^ 
0^ it as a legal -term rather ijhan as a psychometric term. And 
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when weire thinking in tferms of court involvement we -have to lden«- 
tlfy argtunents for convittcilSg d court that a method of selecting 
people, whether it's a test or anything el^e, is ^related to 
: peivformance on the job, ^ 

\ - . • ^ . . ' 

I would like to urge .that all the Work now being dpne under^^the ; 
heading of synthetic criteria be done simply under the heading of 
a systematic method for' gathering judgments (see, I'm on that same 
kick even though it started out like I wa*s talking about something 
else) a systematic attempt for obtaining judgments about Job 
relatedness. ' ' . 

Now if we're, going to talk ^aboul predictive validity, I think I'd 
like to point out that the proper aim of personnel research, . * 
whether it's selection or training or whatever, is to predict and 
influence future behavior or the consequences of .future behavior. 
The purpose of personnel research is not to evaluate instruments. 
We can have a lot of fun designing studies to, evaluate tests or 
training methods or something of this* sort — but personnel' researcth, 
even if dt's spjpnsored' by OSR, is primarily concerned with making\^ 
more proficient personnel. This is its fundamental purpose and we 
can't lose. sight of it. ^ ' , • 

\ — * * 

Now, in flhat ArAy paper (and I'm using this statement, incidentally, 
as an ill>jstrafeion — notj^p,^ criticism — because 1 have no intention 
to critic!^ the intenl^Cf the' paper, only the language?^) , there's a 
statement that grac^es are used as criteria for cognitive predictors. 
I think that- statement illustrates the backward wajr that we^^ofteri 
think about personnel research. We have our hands now a cogni^ 
tive predictor; therefore^ let us look for grades as a good criterion 
that we might be able to predict with this cognitive ^predictor. 
That is not our business. » Our business is to say (a) we want to, 
predict performance -in training, or (b) we want .to be able to ' • ' ' 
predict prpficiency on a 'job, or (c) we want to predict how fast 
people will reach some stated level of proficiency, or wh^itever, 
and then figure out what the be^t way is to predict that parCigular 
criterioi]i. Ooop's, I slipped. That was the second stage. The 
fijj;st* stage is how to measure it. See, I'm disagreeiflg. Dr. Muckler, 
orrymxT sequence^ I think the value judgments should precede the 
measurement-, not follow it as a transform. I 'm^ trying here to make 
a defensitie comment about the .quotation that was attributed to me 



yesterday -^bout "a ^^terion is simply son^fhing that we"pfc%dict" 
because I'm trying' » give the indication that the something to 
predict mu3L be identified before we^evelop a measure for it, I'm 



also being aslittle Mt defensive about th^ comment made by 
Dr; Mull in 3 that- it ^mohow jars to talk about the validity of a' 
criterion: measure . In the first place, I don't think that's really 
conS;lstent with the:, reot ot th^ paper, because if a criterioi^ 
really isn't dlirfe^cntf from the predictor . except in the *^ 



point-of-time 8cal^,4then all of tl^e validation that you would 
do with the. predictoT applie87to the criterion measure too. I am 
*not tlie least bit jarred by ^concern for the validity'of ^ criterion 
measure. I get jajrred when there isn't any Such concern. 

I guess thk only other thing I want to s^y in the summary here is to 
reinforce the Navy's views, or at least Dr. Muckler's vi€?ws , which- 
ever they are, on simple versus multiple criteria.^ I think^ and 
I wish that Mr. €amm wer§ still her6, I think it waft rather shock- 
ing to find that these skill qualification^ tests come up with a" 
..Single score. These are complex areas of performance. There is 
no good reason Xo suspect that any one test is going to be 
predictive of a^l areas of performance in a skill qualification 
criteri'on , , or that a .^b must necessarily always be done in 
precisely the same sequence or, saine manne^r. . And I think that, we 
need to move ^Wjay from World War t and the implications of the. 
straightforward time-versus-leve(|. kind of table into a recognition 
that those interactions that scared Ray so ?iuch yesterday ^re 
quite possible, and- even if they don't serve as interactions, they 
'^y very well serve, as additional main effects. I think* that .we 
have to pay a great deal more attention to the cotiq)lexities'of 
perform^ce than can be carried out with some l^nd of single 
number that is supposed to somehow' represent a coiiq)ensatory 
summary of all of the components* that go into thap number.. ^ " 

Dr. R. Campbell: One has to be courageousr to hold a 2-day meeting 

on job' performance evaluation and criterion problems. Most of the 
papers stuck somehow to the rubric> although we did seem to cover 
an awful lot, of ground that wasn't focused on those two subjects. 

o • ; ■ ^ . 

There was some confusion j-n the papers, I felt, over the term 
"job -performance," and I think that neec^S xrl^^flcation. For . 

t my money, job performance ^^fers on-line behavior and output — : 
^ what the per;&on i^ actually .doing on the Job and pfotlucin^s . And 
I'^d IfKfe to keep tha||Pret:tv clean; that''s what I mean by job \ 
performance. A^^^,.^^^ " *' ^ 

Criterion can mean jail m^ner of things and is not. so' specifically . 
defined in ray taxpiJomy. \^here was Tnuch discussion of purpose, a:nd 
how important/ it yi that we keep purpose in Mnd in selecting 
a crj^teplon. I wMt to echo that statement and eiiq>hasize thatt we 
V ought to keep rfiese multiple purposes in mind when we're selecting 
It ' ' criteria and^when we're evaluatl,ng what we're dol^g. 

It is essential we be explicit about 'the/^stinct'ion '•bfefcween 



ptojEiciency and performan^.. Most of the predictors th^ 
been talking about, mast of tiie selection instruments, the last 
2' days deal v±th proficiency, .which hopefully will be related to a 
I^erson's Output and behavior on the job. Of course it's not hard 



at all to conceive of highly proficient people being not very ^ 
good job perfqriners. I think we l^nd of butted that , along the way. 
If you bring me into-a specJ,al Setting where 1 have to be able to t 
perform Soine maintenance function, ^it may be that I can perform that 
^jnaintenance function bet tfer than most erverybody else you can bring 
in, but you, put me out on a job a^d I d<5n't do very wfell. Okay, 
it's the 'Old saw, it's what a person can do versus what they 
•actually do on tfye job. ^libw if there's a disparity between the > 
two, yhat a per;sbn c§fi do, dn'd what they actually^ do on the job-; ' 
I doi^'t -tie all thar to motivation. We ought to look at mdiagement 
practices,- which someone ai^ound here did mention.. There was soipe / 
agonizing that you haven't increased y9ar validity coeTficients in ^ 
the last ID years. I don't consider that an indictment., Maybe it's 
a function of what you're trying to predict. Perhaps you're some- 
vrtiere near the maximum level .of prediction ju6t looking at selection 
jLndtruments. And perhap's tK§ focus must be. broadened beyond : . 
<*select;.idn. ^ r * ' ^ / ^"^ ' > J - ' 

Anothar-purpose for validation or for 'crite^on selection^ running' 
^ through the papers w^s the ^acceptability of the crit4ri6il*, which 
^translated to the ability to sell our work. Acceptability is 
imp9rt^t in selection a cri|:erion and we muij|[| consider the 
user; However, I just raise a aaution that we don't let the selling 
determifie the research and that we lose our way in the proce.ss. 
And there are other purposes. What I'm trying to say is that \^ile 
we recognize the importance of purpose, I'm not sure' we a^ays 
explicitly deal with purpose (^nd let it guide what we're doing. 

There were, fpr me, some other fu^zy , definitional issues, biit ' 
they've alxeady been discussed. I come down strongly on the si4§ 
of "the criterion problem is not just a measurement problem." It* 
certainly involves values /and judgment , and I just want second 
that. I dUso liked the comment'we heard ti^at 'we get overly upset 
ViVh complicated problems, ^nd we ought td recognize that we're 
dealing with a very' complicated areaw^r. . ..^j^ii ^ ^^-^ > , ^ 

There Was *»^tatement at the outifBt about whether there 2^ a glorious 
solution, and*J can confidently say "No," hec^n^ t^^Wi^ 
anybody's going to get ' one very soon. There is no 
^solution. I thinks ypii^work, work very hard, .af^ devising.' the "^best 
criterion feasible ih ,a given situation. An example of this woul4. ' 
b^'. Christ^l' sLswitchi-ng to a time cK^terion. There/ is no one 
• solution. to criterig^ problems. You i?eact to the realitiiM and - 
complexities^of the situation. I ^axit to cite several things T - ^ 
particularly Tiked in the papers. (5ne was the- discussidri o»£ level^ 
of criteria — from^ individual levels up ,throl^h s3tet^^af leye^ 1 % 
liked the emphasis on measures of proficleilcy alia "^a^i^^^^ Jt don't . 
know these systems very well', I fully support^ the inteti^'ii^^^hemr 
the real training, > the syiHboiic performance testing,' and'the skill 



qualification testing. These kindi of proficiency measures .. , . 

should be very useful. ' * 

Tjifi major omission I noticed in the prografc "w^' the failure' to ■ 
deal with job performance— the outputs and^b^*^yior jsf the pe.rson . • 
on the job. That is, 'aside from -ratings o<%Wrfori^ce'. I know • 
thiols a very difficult problem, it ' s a ve^' expensive problem 

^ sometimes to 'tm to fix, but, again, if we're looking for criteria, 
particu^atly fOT. research purposes, I was hopeful that I would have 
heard -more in the way, of conceptualizing how one might get per- 

• formance measures and the methodolpgy that might be use'd. And I'm 
not so pessimistic as j:o "%ay it cannot be, donje. I wish I had heard 
more abput ^hat.^ j>} ■ ' ' , 

Helmlck: Well, I cerl^ainly w6n't be) sufficiently presumptive to givT*^^^ 
an. Indication iVm^going to summarize everything that happened . ^ I 11 • • 
S4y thap I thii^I'^ve.agxeed with the summaries to date, and I m^ 
s^r■e I'll agree with those' to '/lome. .1 have down here a note from 
this morning's presentation by Col Ratliff that certainly ties in', 
'with what Bob Guion said, and I marked it Important , underlined, 
"Improved way of making judgments." I think that really gets to 
the heart of much of what we need to be dealing, with. 

I don 'f know that I am* disagreeing really with Dr. Campbell on 
making the sponsor or fhe iflient happy.. It^'s pretty x:lear that 

. cjne can overstress that. 1 think 'on the other hand from my own 
experience in applied- work and from earlier militai/y experience, 
it can c^tainly amount to an awful- lot of wheel' spiiming if" your . 
don't have some agreement or some understanding as to what it ie 
that's going to be acceptable and usable. -In an entirely different 
context I recently, picked up a phrase, "Oh, yes, there's a. need 
but there's not a want." And until that want is recognized, ■ , t< 

'perhaps created, the need- may. be irrelevant. 

I do want t/ congratulate the grou]? f«^r, first ''of all, recognizing 
ithe problenf. I think nobody expected that we would have all the . . 
answer? at the £nd. of 2 days, but 1 think it has been, to me at , 
least, a very useful discussion and it's very gratifying to me to 
see the approach ^d the attack that's being taken. . There . are two ^ , 
'•or three- things that I think need to be given some attention. A- 
•number 6f' you, I suspect, heard Harold Gulliksen's inyitefl address - 
at therlast APA meeting. The approach he was describing^ that 
time is reilly one of reVprslug the predictor-criterion priority 
and recognizing that sometimes when ydu get low relationships between 
predictors and criteria the an=.wer may very well be to examine^tt^ 
critexion because you f tequcutly kiipw mu.eh more about tfhe jM|di^&U<5rj^ 
you have a much -better understanding of what it really is TBfin you . 
do for the cri1;eriQn. "The classit example of course is the one 
tliat he used: -the 'early ^^^r^ experience in wMch much to^some ot 



*, the psychologists' surprise when they begto tb^ validate the- Jiavy 
classification test againfet^pprformance in tra::&ing, the diigcovery 

. that for a Naval Machinist Mate, the highest v&lidity was lEor a 
verbal test ^and one of the lowest wa*s mechanical aptitude, and yet* 
the taskr was very clearly primarily that o^ a mechanic. W^ll, the 
simple solution, of course, would have been to accept the criterion 
and <£o utilize the obtained validities as a method of selecting . 
people, i^o would 4o well in the cou^e. But fortunately somebody 
said this really doesn't make too much' sense , lat's, look at the* 

^criterion, which was course grades.^ And'^the grades 'were on a 

^written ^xaBiinatioii' based entiirely on lectures and textbook » 
material. And thi^ individuals^ in th^ course saj^they neVer hadV 
their hands on ailythin^ tha^ resemb^Led^a piece ofi armathent . S9 \ 
this analysis led^to developing what was called the Breech Blqck 
Assembly Test which involved actual disassembly *and rekssembly of 
a ^iQOck-up. of a pai«c of a naval gun. And lo and-behald when that' 
was used as 'a criterion for success *in the course, the mechanical 
'test had a fair axfLOimt of validity and the verbal aptitude test 

.dropped considerably. I jthink"^^e principle which it illustrates^ 
is that^you can sometimes get a great de^ of Information aboiit the 

' criterion, or you. can at least ijaise very meaningful; questions 

* about the criterion,' by looking at vthe relationslrip that known 

^edictors have with it. ^ 

A somewhat related topic is that of the effect of the criterion on . 
the tralni|ig. Again,, this is not reallyqi&lmed at solving the 
criterion problem, but in dealing with th^criterion problem I 
think' one', has to reopgpize that the criteriofi may become a primi^^ 
determiner -of fraini^g, or at lea^t of le'afn.ingf from\the standpoint 
of the ""student,. The naval* example I iust gkv^ may«1:|e a case in 
poiftt. In our own area the example we always come back to is that 
of essay testing versus objective testing f^t writing (composition), 
I think both the College Board and ETS, at leasrt the va^t maj^Drity 
of the staf^, woulfi take a very strorfg position^ that from a 
ipeasurement standpoint witl^ a. given period, of time and a given, ~ 
cost, there is no reason tb use. anything ottterfth^ objective 
measure^ of writing for measurement of wtiting ability. On the 
other ^hand; ,it*,is prettty clear, that as as objective test, 

the marking o»f blanks fln the answer, ^h^et , is tni^nl^ f l^ng^ that ' 
seems to be beifl(|^ evaluated, it's pretty h&rd for teachers^ t:o^ ^ 
spend. the time In grading^and collecting essays, and it's'pre£ty 
hard to convince ^tudertts that they ' should} do^ anything other than 
learn some ot the techniques of taking , objective tests. So f-rqm 
time to time the College Board has }>een^ conviij^ced that, not for 
measurement leasous but tor educational purposes t'he criterion 
has to take the form of acLwaily getting students to do sorife - • 
writing. In soa^e' of my overseas experiences In. 16bl^ng-at th6 ^ 
^ituatlona in other couutries, it's vety <tl^ar that th^ oriteriory^ 
the ^nal 'efXaiul nation , so <i«;Leimine£* Lhe curriculum that in many* ^' * 



(fases the purposes of education afe rather completely subverted. 
So all I'm sayitaa ip that 'in dealtpg wlW the criterion, its effect 
oh the whole learning antf training process, needs to be kept in mind. 

Orie^ last point has pirobably^ been implied V;ln, fill that Has been said./ 
,1 ttlSfnk there's som^^endency^to ignore"; basic considerations of 
.reliability^ of th^ d«tterion if it seems to be objective, if it 
seems- t<V^e' quantified, if It seems to be spepif ic , ,and this is ^ ^ 
nbt necessarily enough. And here I go back' to my World War II . 
^ bombarddei" Research experience where in all of*thQ thlilbe flying 

trailing categories, bombardier, pilot, agd navigator, the ^ 
. criterion that seemed to be the best and the most objective one, 
where yo*u can r&aljj^ get numbers- tjial^ almost f|p3pped out at you, was 
' the average err ai^I^ of the students on Aomb djrotps in ^training missions 
And it'waS' rea3f^l^'-4a j^rdther horrifying discovery whea people came 
up with the ^fadtlfet^t^ja^ predict th6 average error oh ~ ' 

the odd missi^on^ firdtigHnk^ on the even missions . The 

reliability wdf^. eBaentMBv zero fof a0 objective, as quantified ^a 
criterion as oi\e could ' tXnd". We maitiaged to do , a slight follow-up 
on this. , What stai&ed pijtt to be a! very closely controlled 
expejrimental cUass'vfes'? fortul4ittpiy , disrupted by the Japan^eae 
s(Ui^render. We go't enough evidence', however , to indicate thqty in 

measured.^bomb 'dtopping pe(rf ormance ,^ the 
a(pt important link in phe whale '^hain was"* .the bombardier 
rpl«ane, the auto pilot, the degree of tt^rbulence that day, the*. 
*^al pil'bt flying the plafcie, j[:he bomb'sight; of t^ese things, 
_^Jordihg to §x\ analysis ^ variance , contributed -more* to Che 
^ average-^eri^or than did the- bombarcjier ^ So ^e. ne^d to 'ta%e a- hard, 
look -at criter^jta ev^n thou^ they seem t;:o have th^ highest possible 

• ^fape validity and notyue lulled into any sense of false conf ider^ce . > 

Dr; ricCormiJ^k: I do^^^t , tH^^k anything in' the papers We ^ave heard here 
, *^in theset^past cljtai>le* of «days could be viewed as a quani^^ stepL in / 
the of criterion development, but, X think^th^re are some over-' 

tones tfh^t do warrant ;soiBfe' recogriitian and that ofrfer at leasft 
• .'in&de^ik I Encouragement for the future.' In the* first place, I b^liev<i 
* I yl sense a seriousness of ooncern about -thJ^s 'problem in tlie ifiilitarj^ 
. ^serviijes' that hopefully, will provijje the momen^m for cbnjcentrat^d 
attention on thife problem which' is clearly 'a critical one /In ; 
: connection yd.th personnel' research. In the . second* place , i believe ^ 
there are a few bits of^ wheat mixed in with thf chaff that tnight 
take roots and develdjf) into* some new str;^M4i.^o^'ci;iterla or 
apprtyaches' to ^^^^^jjj^'^ development . ^ ' 

• Although we intend to' seek th^ Holy Grail^"^^ the" ultimate criterion 

j'olf performanoe,' we. certainly should not, bypass the operational 
need fcrr "criterirf .of ^achi^veiA^nt in training ^as referred n:o by ./ 
, Meyer in IrLs .discutfi/on of instruptioxial development systems and 
ks discussed by De*Leo in the .paper by himsej'fsand Waters\ J was 





•quit^e impressed by Exed Muckler's paper in whicfi he referred to 
the m£^y facets of ctiteri^ from A to ?, or maybe fron^AAA to 
I suspect that he imu§t^^wve Iain awake many nights to organize yhljlKi 
I believe to fie a vei^||pRgi:)|^c^nt discussion of this problem, 
'In particulf^r in crys'l^iilij a number odHjSsd^Q and iajsues 
which have otherwise been^l3rlc-Ln|i f urtiv^ly^n the background, I 
think especially Tils listing oj^hte cri-kpiria of criteria is one 
that might well be posted on tjie wall6 'SF^esearch of flees (in mucH 
the same manner that many homes ^sed tdf^K^^ 
walls. euch as "God Blesjs Our Home." / ^ 

■ .X' ■ 

In, winding up there are just a nouple of poiflts I might add. • In'' i 
the!, first place » I would iil|^e:^^-fl|uggest some attention « to the * 
notion bf quality control. This is not a new notion, although it- 
has not been mentioned in the confab here.^ I think that qualify 
Control a§ applied to humam performance evaluation is something^ ' 
that has sbine 'sort of relevance to. the problema with which \9e 
deal. , And the techniques and apj^roaches of the Industrial 
^nginqera in connection with quality control pf physical' producta ' 
a^d processes is one that I think can well be applied to the 
perfdpnance of people on their jobs. 

And n^Kt I will reflect an admitted bias in suggesting that I? 
believe the military, services* should pursue the notior\ of '4rhat I , 
prefer to cal^ job component' validity , pxeyiouslpr called synthetic 
validity or generalized validity. This woul'd require ttla develop- 
ment, for a-vgOpd sized -aampLe o^jobs^, bf InformSabion al^^ut the 
relationship between jqb components on the one hand and the* human 
characteristics of thos^ performing 'the jobs l^ft^tl^e other hand. 
§j;ich an analysis mighc offer the possibility *^f"' applying .the 
relationships ,so 'teased (At to* other jobs, thi^eby ^avoid^g tl 
necessity of developing criteria -for each and ev^i;y Jpb cJii 
catiop/ ' • . 

la. closing, 1 would like , to say that I' am really impressc^. hy the 
sense 'of comml tment. of the Individuals wfio*have .pred^nt^cTp^^ers . 
a(0 this ^^eminar in tgrtas their interest in^'the criterion 
probleja and. also some of the ^lotions that have been bandi€id\ about 
I would be surprised if ,;as a jesult^^of this seminar, there would 
be any really earth 3haking resists ^jlat wo^|jJ^Sq|^e the criterion 
problem for all time. At 0e same time, one woiJxi hop^ that this 
seml^^r would at least resultvin' th^ exc];iange of^/ldeat^regarding *thic 
«lmj|^f^^nt problem to the exteift that some 3, oiu3^or 10 years h^nde, 

lefuRPe 



onirwpui.d ' be able to look back and say that thewRPelopment. In ,thl8 
area has been moved forward at le;ast by. a fpw steps because »pf ' 
the org^ani^atioA of this particular symposium. » 

^ ^' ■ ^ ■ • '■ > ■ ■ ■ 

J. Canipbfili: A lot of excellent ftiate rial has been' pr^entej|;' 
the last Z ,days And it * s not 'easy to digest it all ""so 'aooriV^ ' Also , ; 




members have 8t|Ai||^mps of the thunder. . However, let me begin 
by describing Impreslsions stimulated by the 

dl^icusston during^ni''J.a»t 2 days. 

^ One dominant inn^resslon do^s strlkQ^me is one that I have ^ 

Hlways r^ratddJtp c^^ssea tliat £ teach in industrial/organizatiopal 
psychology. THati'^l9»*lf one considers the -major groups of applied 
psychologlsts^^ji the tJnlted States who deal with problems like 
this/ tJhe ves'edrchers.:iWho<^ most sensitive to such problems 

and who seem to Hayg on their subtleties are 'the 

in4J.it ary psychol«iaj^^ Oi/^th? basis of what's happened at this 
conference, t*TRSn't see-^^ny reason to change, tyhat opinion. 

A long time ago^^ -icif;I967, .1 went to^ similar conference sponsored 
by the Richardson Foundation. It took place in North Carolina, 
it was attended by a riumber of industrial/organizational 
psychologists, and it was on the criterion problem. In comparing 
the discussions there with the discussions here» I must make the 
Judgment that the field has come a long way,* at least* ^he' level of 
conceptual understanding is much highe;]^' now than it was then. 

In the same bte^th, I would like to say that the criterion 
problem, as we have historically talked about It in this field, is 
intract^b^. There is no solution to the "problem" and we should 
all get^ away from the notion that a final answer will someday 
present itself. Howiever, one majpr reason the criterion problem 
1;^ insolvable ' is because of the' way we traditionall^have defined 
it.; *For example, I would like to sentence Robert tfflOTndike to 
40 years; of computing *^f^tor matrices by hand for making the 
distinctions between immediate, intermediate, and ultimate , 
Qrite.ria. The concept of the ultimate criterion has been the . ^ 
bane of our. existence and it should be stricken frpm/£pB. language. 
There is no such thingT ilowever, regardless of ita^abel, we ' 
^eem to-^have striven in' past years for somethliig . Wftat i^ 
that Something? My own guess of ^what's in e^eryboc!^|^ ptlfid is that 
it's one kind^f rating ror one kind' pf measurement, thai: will be 
generalizable 'across all^. situations, at^ least in form if not in 
content , and which will al^nost always yield high reliability, 
relevance, and pjredioCabil^ty . AJ,1 this is iA spite of the fact 
that it is very ^]^aspJxable' t6 concludi ^hat performance in . 
certain situations* is at beat no^ .very "prea^ctable .and at worst 
prob^j)ly random, ^pd that no ,one' is , ever' going t:o find> a pre- - 
dictable or even reliable criterion in *suc|^ contexts. This Is not 
t lie fault of psychology "and it is not the fault .o€ applieci 
psychblpgis^a . It is simply the way pertain jqbs hap^pen to evolve 
in certain kihds 0& prganizations* We. may wanf to think abpcit. / - 
changing the organization itself, so as to make' per^orjaance more, 
"predictable; b^it a4opting the goal- of finding predictable measutes, 
when TM) predictability or even re.li'abiiity exists/.has- giveh us 
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tertlble gulxt^ feelings, and we make almost pathological "responses 
Jfeo the „ "problem" as a result. Therefore, one^ general concluBlon 
1 would like, to make Is that we really should redefine the crlterl 
problem drastically and adq|^ a different way of thinking about - 
It that does not Include thiiigs such as I Just mentioned. ^ 

'^.^ Brokaw made 'a statement early on which I would' like to re- 
empljiaslze. He said JEhat perhaps what we should be aiming for, ' ^ 

'If there's any one tlilng,'ls*a useful strategy for arriving at 
criterion measures. That Is, what we need Is an overall plan for 
hpw to approach the development of a criterion, not a set of 
specifications for the criterion. * 

r » " ■ 

A number of the fbllowlng polnt^have^* been mentioned already, but 
I would like to consider them' again briefly. First with regard 
'to* the problems of ratings, notice how^easy It *ls to forget the 
parameters one should not forget Gulon 'reminded me pointedly 
that "Well, you must-* distinguish research VerBus operational 
kinds of measures." did forget to do so,. and I am Borry, 
Besides distinguishing between research criteria and operational 
criteria that are actually for the purpose of apprafslng people^ 
there are also criteria that have ae their malsT purpose maximizing 
the usefulness .of performance ^feedback. That Is, the'se ar^ > 
criteria that are appropriate for training and development pur- 
pases, but which are probably not very useful for^tesearch or 
gippralsal. Just to -state the obvious moral, it's' very easy to 
forget the purposes for which we are going to use a sptecific 
measure. ^uch forget fu^iess is an insidious disease. What is 
the besi way to inpoculate ourselves against it? I donVt know,- * • 
but we should keep trying. 

Second, I would like to echo what Dick Campb^&ll said about the 
mil:j.tary's work on performance simulations. It is pretty exciting 
stutf. Ojbvipusly when usin^^imulations there are pitfalls that' 
must be face^ at some point. For example, if there are truly 
Important decisions to be' made ^bout ^i^ople on the basis of-^ such 
a measure, then you might find^the same phenomenon that Dr. Helmlck 
Just mentioned with regard to the educational setting. People/ 
will start emphasizing the behavior measured by simulation and 
not' the Job activities they Had been concentraing on previously. 
T^t i^V If pebple are to' be rewarded for high scores oh a pat- 
t^Tular kind of measure, thatis what they will' try to maximize. « 
W^slmply can't' get away fcrom B.F. Sklnher. If simulation 
continues to become a more widely .used method for assessing 
performance 1^ then we realj.y must. worry about whether it is the 
specific behaviors on the test that we want to emphasize. 

Alsff, as I mey^tiox^.d before > It is easy to slip , into thinking that 
performance" as sessmeint , whether it be for research, criteria or, for 



any of tftjfe other purposes, takes place - in » a vaCvUUin. Even If wa- 
make the: argument that a particular study is for research purposes 
only, we still must worry about how people arq co&perating and 
how they actually respond to what we ask of them. Vfe seldom go 
back to people and say, "Wef asked you to participate dn a research ^ 
project, is that what jpu really thought was going on? Hby. did 
you respond to the briefing? etc.?'^ Such an ^^,:^camination' of the 
research process could be very informative, if putsued. 

Regarding Gulon's comments about the^ economics of criterion »' ^ \$ , 
research, I would like to speak aat one citijsen (i.e. , taxpayerO^- 
^It reajly is dlscohcertlng that people back "there" will yaste -so 
much money* on ^ much else and then starve to death one, of ^fcur ^ 
most important military manpower* problems. ^ . \, 

One curioufr thini I notic^ about the last day and a half is that 
the literature b«ng cited wasn't v^ry recent. I seldom heard • 
date bey9ncf the late 60' s. The 1970's, were m^tioiied" very ; 
Infrequently. I'm. wondering if that's because nothlngV^ happened 
during that period, or ^.t^'s not worth much, or what, it's just 
an impression I have, and perhaps it -is inaccifrate. 7^ — 

Finally, switching from describing impressions to givihg advice, 
let me make two or three suggestions. One is that I agree with 
Dr. Muckler t^at we really* have, to stop sounding, so pessimistic. 
We really know 'a Idt^ more about ctiterialpahd the .ctiterion problem 
**than we give mirse'l^^es Credit' ^ we ought^tj^ tell people 

t'hat;. We shouldn't keep making ourselves look^^o^'bad . For 
example, long b^ore thi discrimination question^ reared its head ' *n 
in -the. -selection doipain , we talked outseives into the*riotion that 
we had to have perfect ptedictions in order to do our job right, 
/[oi^t least: correlation %'ef£icients of '.75 or better.. .It is not 
' ^^urprising that when lawyers ^and the courts came along, ^ they. 
iot)ke(lH" in our ^textbooks and assumed*' that near P^"^|^^ predic^||Lcn 
'Wasi possible,' if only! tW ps^choitjgists woiud get oheir heads 
'tpgifether* As a result, it -Is now* our^ faiiltt thaft prediction Isn't 
4)erf ect^v . , . ' ' J^^"^^^ ^ 

Something about which we dldb bear enough is 'that* a criterion 
measure dii^ectly reflects the values of the organization concerning 
\Aat Individuals should be doing. By' Implication, tho§|e things' 
\Aich' are selected as criteria are' those things which the organi- 
zation says .atfe impbrtarit for people to do. The criterion^ is the ^ 
variable of real interest Now, what is the value systenf^f ' the 
organization?- The obvib.UjS answer is that it is mafiy thing^, and 
tl^ere are those wltM» the organizations who disagree. For 
gxlimple, in your 'siCuation you might ^^e putting together ajnwrisure 
o| pilot proficiency and ther6 could «b ;wide disagreement withirv^ 
"maitagement" ais to whether a high scolrfeoj: a specific component 
X IT ♦ .. . 'i , ^ 
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of proficiency Is good or bad. any organization. If yoir carry 
out the criterion 'devexbpment process correctly^ you will Involve 
the users y management, and the rank and ^ file, and ^concern yourself 
with trying to £lnd out their values and preferences for how high 
and 1(^ performance should be defined. Such a process will mpst 
likely uncover serious conflict. Certainly that Is the case in 
educational 'institutions like large universities. It Is legitimate 
conflict about what behaviors and accompllshit^nts are vkluable \ f ' 
I to the organization, and which are not.. I tlilnk we have to prpgram ' 
Into our criterion development activities some more systematic 
procedure for confronting sifch value conflict and dealing with It, 



One kind of research" thai I personally wou33^11ke''to see conducted ^ 
more frequently by people In theAinllltary and elsewhere ,,'9^as to do 
with more^applled in^stlgatlon of tte" judgment process itself .'^ 
Dr. Guidn fli^tloned a,lJt|«tJiB wh^le ago the notion of "true" score- V 
on a performance dlmen'^oh>^ to be rated. Some .assocla^t'da of mine 
in Minneapolis (Bonpan, 1978), tinder the sponsorship o£/!AJtf,v^^ »^ 
conducted a study in which they tried to program, ^the p^rfbjtqo^ 
of thes, people to be rated as precisely as they could . That.^^ i^^^^^ 
pdoj^le were given, scenarios of behavior episodes that ill^trkt^^O^ i^^ 
'e^lj^ple^ of high, mediiwi, and low "performances on , various ■ ■ 
dimensions. By ^pareful rehearsal of the "actors," the exper j4V|^te(^^ 
tried to establish a true~^ spore fot performance at various lervels. 
That is, they were trying to set up a situation where Ifie' pets- 
formahGe of an individual on each oi the factors was known.* Th^ 
questions to be investigated concdm what the observers do wl!:h the 
perfq^^mance jjiformatlqn. Do bhey mak^ l^rge errors? A^e errors V' 
X , Systematic or random? What "^kind of ^|p|emat;ic ef roitk are p^rasentr? 
^ Wfiat method of assessloeht yields the* smallest erroi^ Tha behaylfar * 

episodes used by Borman were rather brief which pethaps wab .'the • 

study ""s main/drawbaok;, nevertheless, I tMnk the paradigm could iSja- ^ ' 
applied iif.Aaiiy different Gontexts. However, it shdulld not bie 
translated^ into broad survey/format . We have had enougti ijf that. 
The method would quickly, Ig^fe its fidelity tWere. In- tet^, I think' 
we cpjLild;.Ieam a l6t about what's going on in^the performance ' * 

judgment; situation by a more intensive look at the actual prbc'^sses ^ 
that Cake place. 

^ Also, I donVt think we*v^ done ,^enough with.|;j^ 

methods fot ^amp^ing taisjc behavior . WeW;e toS^qu;|e^ . 
. a cbnslderafiop; of how to 11 people/can rate ^effo©^ 

IM like to gD^^ack toj more research on how beat j^o . 

^ sampling and describihg.'' . Si ^» * - ^ t * 



Let me leave yoti with a very noa-tradltional. question foJRbur Kind 
of applied Rl^ychologlst. V^at: would happe^i if putr otr'k J5.F; 
\3klnner, Inc'. hard hat and routinely did anr opermt type funcj^^nal 



ai^alysis of eveiy performance assessment'^sltuat^pf that wei^ii^^ 
What re^warda and punishments control the^behavlors pf the subject^^ ' > 
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.and^iltlmately, the rat^e^? What rewards and punishments control 
±he^ehavlor of the sponsors of the research^ In general, what 
are the reinforcers and the rei,nforcemdnt contingencies that^ 
control *the entire criterion development and performance measure- 
ipent sysitem? Without a clear understanding ojf such relationships, 
criterion research often must swim upptream. It would be better • 
p6 go with "the flow, so to speak. 




COM^^ENTS^ ON SYMPOSIUM (3N CRITERION DEVELOPMENT 
FOR JOB PERFORMANCE EVALUATION 



John S. Helmlck 



I appreciated the opportunity to participate in" the criterion 
symposium and found it a stimulating experience. I congratulate the 
Air Force for its recognition of the in^Ortance of this problem and 
its straightforward apprt)acl\. to trying to de^l with it;. 

I also want to comig|§nd the sensitivity fexp cessed by several_ ' . 
individuals^ to the importance of sponsbiftifdcceptabllity and user 
requirements in making applied resea^jgch effective. The Qommuni*cation 
between the researchet and trhfe user Seems 'to me tp be one of tfie most 
critical features in being sitire that applied research is actually 
applied^ ' This should b^*a|jnajor concern from the initial definition 
of tHe problem to thfe preparation of the final report. 

e n to Measure 'the Criterion • . ^ 

Perhaps the major problem ir^ thfe whole area is that of determining 
here in the time frame to attempt to define and measure the criterion. 
>ould the criterion be a measure at some point in training or on the 
job? If tfhe latter, should ±t%e initial performance or later * peri; 
formance? Almost inevitably the accuracy of prediction decreases jsrs 
the time span between its measqxementv and the criterion measur'^dfent 
'increases,- yet the importance of the criterion increases. While some 
of the jjitervening variables- between prediction and later performance 
can be anticipdtefi and accountecy^for in predictioji, in ' general they • 
cannot be. To de4l with this we need a bettejr understanding of the^ 
chain of event^ii^ween initi^ and final ijjtt|ure^ and as much 
knowledge as possible of tj^ir interrelati^Bnlps. One of th^ papers 
suggestefd that the differentiation between predictor ariji criterion^ 
w^s essentially in the time at which^eaoH was measured ^fld s,upporte4 
the procedui'e of successive measurement. .While I am not willitig^to 
accept the^vprinciple that the prftei^ion does not hav^ a kind of ineianing 
different* from that "of the predictdig^l agree that, attempts to 
dif ferentia^l^hem dn some, sfmpj.e ail'^enco^assing basis lead to diffi- 
culties. TWanswer, if any, to the problem J.les; in understairiding-a 
^network of sucpessive measures taken throughout th^ time span' and this 
really implies-.taja^. one must detep^ne and un%'^t4nd the underlying 
psychological ^yinciples that relate antecf^en^o consequent, if -npt 
actually <:ausei to effect . ' While this ideally ic5ll Is for longitudinal 
3tudy, a series .l)f short-term, alntoat^crbssrsection^l, studiM may * 
yield satisfactory approximat±ons^::^-^A&^ suggests that satisfactoty 
cri^rion research -may .ve^i7j|j|g^l be bjpc psychologieat' research and \ 
that wo«Kk' that*does not ^eaogittzev tfii|^ may have significantly, less 
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genetal payoff in the long run. .^n. Ais 'connection . the. distinction Aad 
between" evaluation anff^measutement should^ be kept In mind. "A satis-: ^ 
factory d^inlt;Lon of crlterijlon, ijerforimance does. require* value JudgtoAit 

Restriction of Range . . '^Mi^jl- . * '-^ * , . 



In addition to this somewhat plillosoi)hlQal general statement^'there 
are a number of brief -coimnents that may ' be -wortb noting. , A numbi^^^bf 
participants pointed out that the restriction of range rafter trdirilng 
does present a problem. "Bhis can "becoipe an even greater problem if 
the predictor measures are used as a basis for compensatory training. 
If effectlv^ this will produce a self-defeating prophecy making the 
predictor seem less useful than it really is. The suggestion should 
pursued that Pne_look for differences among' individuals^ considered 
successful by the traditional criteria already in use. Some of these 
differences may provide a^ basf^s for new critexion development. 
C 

Reverse Validity ' ' ^ ' ^ \ 

K ■ * 

• • • * * 

The concept of ^'reverse validity" is worth pursuing. This is 

simply the recognition that predictor measures are often better under- 
stood thalf the criteria^ and hence high or low relationships may 
provide insight Ijito the nature of thS criteria. This contrasts *lth 
the vteual procedure of accepting the criterion as the given and judging 
the ptedictors on that basis. - . 

Unreliability 

• I Considerable attention was giverr>to the unreliability of criteria, 
particularly wheizi^they take the fortn of ratings, essays, oral examina- , 
tion^, and other admittedly subjectiva judgments. This, is all to the .. 
good, but: it should not aljow one to assume ,t;hat because quantitative, 
apparently objective, accurately measured performances are used^that 
relLability vMllbe automatic. The •unreliability of error measures in 
practice bomb propping* during World War II cadet training is a case in** 
point. • ' ' " 

Effect on Training ' *. • ' " • . 

#/ f ■ 

In searching for and introducing criterion measures, it is impb^- 
tan* to be alert to their effect on the training process. Thls^s 
especially true jif the criterion involves ^sampling a relatively OTiall 
juniber of all thbse*^it^s which should be Included in training. In 
such a' case the inevit^ble'tendency is to gr^ually cc^centrate the " ^ 
tEalqing'on only those items that are to be evaluated. . ' 




/' It Mfems'^worthwhile. to pursue thg ^tsure^ient/ of speed as ,a way ' 
of d^aljUig with the criterfon. I is^e this as lar^ly a matter of, a / 
dlf fQ^ent' way of measuring ;the criterion .tathe» tjiaa' producing a truly >^' 
different cfiifribn. ^ One still has to make the Judsnent. ^out ah^d 
decfflloh on what beJhaVtoy* will detennlne^ for measure^tit-' ^ 

of tUme conSuined>/ ' , - . " . 

' \. - " ^ , . ^ 

Group Performance ^ ^ • . 

It was recognized that some of the examples^ described really 
involved performance of a group or system rather than that of a single 
Individual. It seems desirable to keep these two types of performance 
separate. While Individual perfoi'mance can frequently be^ aggregated 
to provide a group measure, It Is likely that In many cases the grouiy^/\,^j^^ 
petfonaaxice will rel:iulre some separate measurement of group Ic^cibifes j?^^ ' 

Cognitive. Emphasis 

,One'flna^ nolse^ I was struck with the continuing emphasis In many 
of the present atlons&ts cognitive arid lntelle9tudl variables. I would 
ribt want to underestimate thelj: Import^arice and, as one who's been 
involved in work almost entirely concerned with such variables for 
many years, I recognize the much greater ease of measuring them..^' 
Nevertheless I think we can continue to be^ lulled into false feelings 
of success by putting too much weight on stach measures. 




A pervaddfig theme of this symposium has been the need for con- 
centrated effort directed toward "doing something" about the criterion^ 
•problem in personnel tesearch in.the military seryi<::es, with particular 
- focus on the measuremen^ of - on-the-job perfoCTahce. 

Although I thlnjsi nothing in the papers we » have heard fould be' 
viewed as a "quantum la te p" ±n the "solution" to this probiea, I think 
there are overtones tfi&tf^5Fbelieve llo warrant some recognition, * 
and that offer at least modest encouragement for Jthe future. In the 
first place, although I may be^ bijlt Pollyann^sh, I believe I .sense a 
seriousness of concern abcmt this problem in the mildCtary services that 
hopefuMy will provide th*momeiitum for concentrated* Attention on 
^his problem, which is clearly^ a critical qne lii corinecbion with 
JpiKTSonhel research. And in the second place, believe there are a 
v^w b^ts of wheat mixed in with the' chaff ^that might take root and 
develop into some new "st.rain" of criteria or approaches to criterion 
development; Let me now touch perhaps a i)it randomly on a fw^Of the 
jjOints that were made in the papers that seem to me to be of particular 
interests ^ _ " 

To* begin with, Mullins and Ratllff in their .(lis cusslon of the 
"(ij^lterion Problems" emphasize the pcsint that the best predictor of 
future ach;|||vement iaj some indication of past achievement. ' (This 
theme, of course, has been expressed by varibud people, including ' 
Wemimon€ and Canq)bell in their paper "Signs, Sapples^ and Criteria.") 
Following along this line, they raise the question as to whether there 
is really any difference between p^erdlc^x^g^d^ criteria, since Both 
are measures of achievement of* some typh'^^^^uy^p^ great deal of 
sjnnpa thy with this point of view, since predictOTS ai;e ooeasures. of 
some type pf achievement. But granting the basic thesis that pre- 
dic^prs and criteria both al^ measures pf "achl<|||femeAt — that is, „ that 
they: do not differ in their naturfes — believe that' at lea^t in many ' 
circumstances they do differ substantially in "degree , ^• particularly 
the degree of complexity. In other words, I- believe that ^irlteria 
generally are harder nuts to crack th£ui- predictors. • 

\ was interested, in the listing by Weeks ^d Mulliris (in their'/ 
'ape r on "Rater Accuracy") .bf the basic dimensions of the ratimg 
paradigm, these being: (1) the jat€f51(2) t^ie person rated, -(S> the. 
trails or tasfks to rated, (A) thjB social' ehvltonmen't , and (S) tl^' 
phy/ijcalJ environment I believe all of theise warrant: -j^stematic 
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Ihvestigation as sou^oes p;^ possibife variance (dSrror and atherwisq): 
in crlter^on-^ff^Velopi^ siif^jjprt t help proposal .to ^ 

explore, sbiae of the, problems associated With' raCei;^/*' It woiild -l?^ ' , '^ f 
particul'^rly; useful to be able to identify- those^ individuals wHo can ; , ' 
serve 'as good rateM and also to explore the exten^ to which training * , 

of raters can. improve their performances- " <ln. a, Study we have ju^t . . 
completea it ^as found 1:hat even moderate^ training of raters had . 
some beneficial effect /Upon the ratings, made by them.) ..y , ?' I - 



Aside' from the factors which they mentioned ^ however > I believe • 
there is anc/ther ^area that warrants Substantial attention, and that 
relates to the typ^ of "rating" procedure that. is used- Curton, , 
JUtliff, and Mullins in their ^paper "Content nftnalysis af Rating 
Cj^lteria" do in fact refer tp this matter^, in particular by referring 
tp the use of behaviorally ancj^ored scales as- contrasted with con- 
ventionaX rating scales. ^However, -^there may be t5ther approaches to' 
the development, (>f, criteria t)iat. might also be subject t^'some comparaj- 

• tive analysis^ Other types of rating procedures of course include^^ ' 
fd*ced choice B^tliod,' the weighted checklist ,^ the various personnels 
comparison systen^p (suph as rank' order, paired comparison, and f orced v 

. distribution), and the • crltiCfil incident technique- Actually most of 
these methods of ^ obtaininfitetsonnel appraisals differ in .the nature 
of the human responses t^at are. required- For example, «the forced 
cljoice cl^eQklist dnd the , weighted checklist depend^^pretty much on, 
the "descr:||tion" o'^f behaviors ratljer than making evaluative judgments 
about behavior- Jn^tum^ the conventional, rating scale requires the 
making pf absolute^ judgments as contracted with the relative judpnents 
that are required by the personnel comparison systems. Vario^^ 
experimental studies dealing ^th- tJuclgments people, can make about . 

' physical stimuli indlQate ,that people are much better in making rela-^ 
tive judgments than in |^fci^ absolute, judgments- T thoroughly, support 
Christal.*^ argument foiMhe upe of relative judgments in at least many 
circumstances where humffi judgments must be used. 



Along this line, the suggestion made by MullinS and Weeks in . v> 
their paper "The Normative' Use of Ipsative Ra?tljxgs^V te^^^ather 
intriguing one- In addition, I might^ refer to some of the notions 
- that were sugges^'^ back^a few years ago In this same hptel relating 
tp 'performance evaluation of Air Force of fleers - A number af ^rather V 
ingenious suggestions were made at that time ttiat might, be further. * 
explored in connection with their relevance in developing criteria, > 

Thus, I would urge further: comparative studies of different 
methods of evaluating the performance of Individuals, including com-* 
parisonsfef different "types" of, human judgments, both as relkted to 
jtJieic >p8ychomet-ric properties and to their pjax^tfcal differences, — 
' referred to by Curton, featliff^ ^and Mullins |^ 



as 



/ • However, j^e must re'cogni2e--'that jgio-f^tlng^r^ going to 



^compensa^^ f<yt* poor human judgments/ Granting this ubiquitous fact, . 
however, we should try to ;find out-, as much as we ^an about the processes ' 
of making human jtidg^nents^, toward thfe end of the development of the. 
"ratinfe procedures" that provide the best pppoi^unity for eliciting the - 
best judgments people cen maK&^ pulSlic pronouncement of a . ' 

'mid-caree^ s^ift to invest ig^e ,th<b processes of making^ human judgments . " 
. is indeejd ah encouraging sTgsif^', " . • ^ ^ 

• The development and use of on £r^ job sample' tests or what Foley 
rfefers to as performance measuremejit (PH) perta^Lnly deserves . some place 
^ iir the-iitritary system, thei^e Is"^ jprobably tio questlpn^t" that the' u^ 
of such tests can provide h reasonably adequate ^asis for the deriv 
of criterion valueg and performance' measiirgfiji)^ IjidlvitiualB,.* The 
problem with resj>«lt 'To^tip^ as we all knoWt is that of time ^ 

and cost. .1 presume the basic problem here is one pf somehow 'dJlermln-^ '/^ 
ing those areas and types of. jobs for which this^ tlpe, arid cps^: .would ;biB^ 
cost effective if such tests Wfer^\used# as contrasted to those areas . 

tare it woulcL-not be cost effective. Because- of the practipal'-problems 
^cost and time involved in development and Use of- job s^aii5)le tests, 
however, I woul<i urjg^ further explaratibn of the "simulation" of such . , 
tests as Foley., suggested, and of .the extent •ta^h'ich"sampli|ig" the p , 
performance ct£) various aspects of the job c^j|^H|f4^c^ criterion values/-: 
that may approximate m^asure^s of perfoiTtianceSflj^fte total job. .Thfe 
theme of simulatiion and sampling yas" also referred to by Mullins, * 
Ratliff, an^J^ies in theiif. paper *on ."Synthetic Crit^ria^" If they" 
confirm th^'^^itiding that their "R-technique" and "M-techniqu^" provide \ 
the basis for 4®riving -estimates 'of performance that are • strongly ' 
correlated with actual performance, jsuch a routfe i^ ohe that^hould be 
pursued. * . , , ' ' i 

Although we tend to seek the "Holy Grail" of the ultimate -criteria 
of performa^ice^ we certain^, should not .bypass the operational itteSfl for^ 
- criteria of achievement in training, as referred to. by ^feyer in^ his ' , 
discussion of . Instruct ibrial Development System^ » (IDS) , and as <^s- 
cuss^d by DeLeo in the paper ^by. him and Waters . . . / . ' *^ ^ 

' V \ . ' t ; ^> " ■ . ' 

i was very mjifch interested in Dr. Christal's suggestion regarding 
the use of "time* as a criterion, 'with variations in term3 of the ' V 

\speed of acquisition, decdy, and reacquisitioii. I personally feej. . " 

.that the time taken tp^feam somethirtg is , at least on rational grounds,' 
an indication of learning ability, and feel that '^£f|^S to use time as 
the basis for establishmeht q€ criteria might we!}J4^wl|pant I c 
attention. In this regard,* howl^very^I might comm^ip%*H)n one^^possible 

•problem, and that is the problem of deterndilb||^he po±xi^jM^t^ie::at ^ . 

;which a person's perjformance' or acquisition iPi^Kill /h^a^^HB^ /S 
"satisfactory" levjel*^. ' (Although, time might then be a "^^^^HfW' 

measbre, the u'se o f ^Iiis .^les .not c<)iiipl|tely ^void the'^I^P^. make 

^bme^determiijation a^^^ts. the '^levej" of ^^rfarmance of^thdividiiais.) 



As a sideline comment about time, I might add another point, that the 
stage at which a person initiates his learning presumably is an important 
factor in the time taken to achieve some previouisly determined level of 
proficiency. This matter has been rather thoroughly explored by 
Stanley Lippert, to the point that he has derived an "equation" for 
taking into account the stage of skill at which the person starts 
training, and has found that this improves very significantly the 
prediction of the future learning of the individual. 

I was quite entranced by Fred Muckler's paper in which he covered 
the many facets of criteria from A to Z, or perhaps from AAA to ZZZ. I 
suspect he must have lain awake many nights pulling together and 
organizing what I believe to be a very significant discussion of this 
problem, in particular in crystallizing a number of points and issues 
which have otherwise been lurking furtively in the background. I. think 
especially his listing of criteria for criteria is one that might well 
be posted on the walls of research offices in much the same manner that 
many homes used to have^Eramed mottos on the wall such as "God Bless 
Our Home . " 

Before I close I^J^uld like to add three additional reflections. 
In the first place, althotigh ratings have been thoroughly maligned 
many times over (and certainly with some justification), there are at^ 
least a couple of factors that will probably cause them to be with UB 
for a long time. There are some aspects of human behavior for which , 
human judgments probably are the most appropriate basis for evaluating 
performance. Furthermore, there are some aspects of human^ performance 
that conceivably should be evaluated on the basis of some "objective' 
measures-.-but for which we have not been bright enough to figure out 
adequate methods of measurement. In such instances the basic problem 
may be one of figuring out the best way of obtaining reliable and 
valid j.udgments, rather than being overly obsessed with the notion of 
obtaining "objective" measures of performance. , 

In the second place, I would like to suggest further attention to 
the notion of "quality control" as applied to human perfonjiance 
evaluation. This is not a new idea,' of course, but I believe it has 
some further relevance to the criterion problem. 

And in the third place (in which I will reflect art admitted bias) , 
I believe the military service should pursue the notion of what I 
prefer to call -job component validity (previously called synthetic 
validity). The development, for a good sized sample of jobs, of a 
sblid data base characterizing the relationship between job components 
on the one hand, and human -requirements for performing the activities 
involved in them on the other hand, might offer the possibility ot 
applying the relationships so teased out to other jobs, thereby 
avoiding the necessity of developing criteria for each and every job 
classification. 
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In closing I will say that I am really impressed by the s^se of 
'coinmitment af the individuals who have presented pdp6rs at this, 
•seminar i in terms pf -their interest in the criterion problem ahd also 
at some of t^e notions that: have been band:{:ed abotit. I would be 
^surprised if, as a result of this seminar, th^rd wou}.d he any real . 
^.earth-^shaking results that would "solve V the criterion problem for all 
time. At the^ same time v one would hope that this seminar wbyld at 
least result in the ejkchange of idea,^ regarding -this important- ^ 
problem 'to the extent that^some 3, or 5 , or 10 yeaifs hence one would 
be able to Ipol^ back and say that development in this area has raaved 
forward by at least a few steps since this time- * » 
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